The colour values must be close for all pairs pixels within the same instance, parameterised by a margin $M$:
$$\lVert \Phi_u(\mathbf{x}) - \Phi_v(\mathbf{x})\rVert^2 \leq 1 - M$$
They should be further apart for all pairs of pixels from different instances,
$$\lVert \Phi_u(\mathbf{x}) - \Phi_v(\mathbf{x})\rVert^2 \geq 1 + M, $$
The problem with using convolution networks for this purpose is that they are translation invariant.
For two different pixels $u,v, u \neq v$ within a instance it must be that $\Psi_u(\mathbf{x}) = \Psi_v(\mathbf{x})$
Thus for all the pixels $u$ in an instance $S_k$, it must be that $\Psi_u(\mathbf{x}) = \Phi_u(\mathbf{x}) + u = c_k$
We can think of $c_k$ as the centroid of the instance.
$$K(u, v) = \exp\left(-\frac{\lVert\Psi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert^2}{2}\right)$$
$$\lVert\Phi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert^2 = \sum_i\left(\Psi_{u,i}(\mathbf{x}) - \Psi_{v,i}(\mathbf{x})\right)^2 \\= \sum_{i=1}^{2}\left((\Phi_{u,i}(\mathbf{x}) - u_i) - (\Phi_{v,i}(\mathbf{x}) - v_i)\right)^2 + \sum_{i=3}^{d}\left(\Phi_{u,i}(\mathbf{x}) - \Phi_{v,i}(\mathbf{x})\right)^2$$
$$K(u, v) = \exp\left(-\frac{\lVert(\Phi_u^g(\mathbf{x}) - u) - (\Phi_v^g(\mathbf{x}) - v)\rVert^2}{2}\right)\exp\left(-\frac{\lVert\Phi_u^a(\mathbf{x}) - \Phi_v^a(\mathbf{x})\rVert^2}{2}\right)$$
$$K(u, v) = \exp\left(-\frac{\lVert u - v\rVert^2}{2}\right)\exp\left(-\frac{\lVert\Phi_u^a(\mathbf{x}) - \Phi_v^a(\mathbf{x})\rVert^2}{2}\right)$$
In practice however we use the following kernel which is Laplacian rather than Gaussian, making use of the the Euclidean distance rather than mean-squared distance between the vectors:
$$K_{\sigma}(u, v) = \exp\left(-\frac{\lVert\Psi_u(\mathbf{x}) - \Psi_v(\mathbf{x})\rVert}{\sigma}\right)$$
Here the parameter $\sigma$ is learnable.