Semi-convolutional Operators for Instance Segmentation - Part 1¶

Instance segmentation through pixel colouring¶

Let $\mathbf{x}$ be an image with $\geq 1$ with various objects that we wish to segment individually
We want to be able to differentiate not only between objects of different categories but different instances of objects from the same category within a single image
One approach is to map each pixel to a continuous real number or "colour" given by $\Phi_u(\mathbf{x})$ for a pixel $u$.
The colour values must be close for all pairs pixels within the same instance, parameterised by a margin $M$:

$$\lVert \Phi_u(\mathbf{x}) - \Phi_v(\mathbf{x})\rVert^2 \leq 1 - M$$
They should be further apart for all pairs of pixels from different instances,

$$\lVert \Phi_u(\mathbf{x}) - \Phi_v(\mathbf{x})\rVert^2 \geq 1 + M, $$
The problem with using convolution networks for this purpose is that they are translation invariant.
So if there are replicas of an object in the image the network will assign the same colour to each.

Semi-convolutions¶

We take the output of a convolutional operator and mix it with the pixel location information using a function $f$ to obtain a non-convolutional response $\Psi_u(\mathbf{x}) = f(\Phi_u(\mathbf{x}), u)$
A simple way of realising this is to make $f$ the addition operator so that $ \Psi_u(\mathbf{x}) = \Phi_u(\mathbf{x}) + u$

For two different pixels $u,v, u \neq v$ within a instance it must be that $\Psi_u(\mathbf{x}) = \Psi_v(\mathbf{x})$
Thus for all the pixels $u$ in an instance $S_k$, it must be that $\Psi_u(\mathbf{x}) = \Phi_u(\mathbf{x}) + u = c_k$
We can think of $c_k$ as the centroid of the instance.
In this interpretation the convolutional operator outputs the displacement of each pixel in the instance from the centroid.

Steered bilateral kernels¶

Almost always two instances of the same object will have some distinctive traits.
To incorporate these we allow $\Phi_u(\mathbf{x})$ to have additional dimensions.
But eventually we still need to recover location information at the end.
To so consider the Gaussian kernel

$$K(u, v) = \exp\left(-\frac{\lVert\Psi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert^2}{2}\right)$$

Suppose that $\Phi_u(\mathbf{x})$ is made to be $d$ dimensional (with $u$ padded with $d-2$ zeros for the addition step).
Then squared difference term becomes

$$\lVert\Phi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert^2 = \sum_i\left(\Psi_{u,i}(\mathbf{x}) - \Psi_{v,i}(\mathbf{x})\right)^2 \\= \sum_{i=1}^{2}\left((\Phi_{u,i}(\mathbf{x}) - u_i) - (\Phi_{v,i}(\mathbf{x}) - v_i)\right)^2 + \sum_{i=3}^{d}\left(\Phi_{u,i}(\mathbf{x}) - \Phi_{v,i}(\mathbf{x})\right)^2$$

Thus the kernel can be expressed as the product of a geometric part and an appearance part

$$K(u, v) = \exp\left(-\frac{\lVert(\Phi_u^g(\mathbf{x}) - u) - (\Phi_v^g(\mathbf{x}) - v)\rVert^2}{2}\right)\exp\left(-\frac{\lVert\Phi_u^a(\mathbf{x}) - \Phi_v^a(\mathbf{x})\rVert^2}{2}\right)$$

A more common similar formulation called the bilateral kernel

$$K(u, v) = \exp\left(-\frac{\lVert u - v\rVert^2}{2}\right)\exp\left(-\frac{\lVert\Phi_u^a(\mathbf{x}) - \Phi_v^a(\mathbf{x})\rVert^2}{2}\right)$$

The difference here is the absence of the displacements added by $\Phi_u^g(\mathbf{x})$ and $\Phi_v^g(\mathbf{x})$.
So we can think of the new formulation as a 'steered' bilateral kernel where the pixel locations have been distorted by the network to bring points within an instance together.

Kernels in practice¶

In practice however we use the following kernel which is Laplacian rather than Gaussian, making use of the the Euclidean distance rather than mean-squared distance between the vectors:

$$K_{\sigma}(u, v) = \exp\left(-\frac{\lVert\Psi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert}{\sigma}\right)$$
Here the parameter $\sigma$ is learnable.
The reason for this change is that the elements of $\Psi_u(\mathbf{x})$ should be rescaled before comparison in order to balance the spatial and the appearance parts.
However since $ \Psi_u(\mathbf{x}) = \Phi_u(\mathbf{x}) + u = c_k$, the scale of the spatial part is implicitly defined since it should be on the order of $u$
The parameter $\sigma$ controls the scale of the the Euclidean distance
Since the Euclidean distance is smaller than the squared distance for a more distant point this kernel is also more robust to outliers (perhaps points in an irregularly shaped object)