Transductive Centroid Projections - Part 1¶

A deep neural network can be regarded as a classifier attached to a feature extractor
The feature extractor consists of the all the layers but the final dense layer and outputs an embedding $f$ for an example $x$.
The output of the final classifier layer takes input the embedding $f$ and outputs the prediction $\hat{y} = W^Tf$.
Each element of the prediction $\hat{y}$ is given by $\hat{y}_n = (W^T)_nf = W_{:n}^Th_L = w_n^Tf$ i.e. dot product of the the $n$-th column of $W$ with the output of the .
The predicted class $n'$ will be the index $n'$ at which the dot product is the highest, meaning that out of all $w_n$ $f$ is closest to $w_n'$.
This means that the weights $W$ of the final dense layer lie in the direction of the normal vectors of the decision hyperplane learned by the model.
We call them anchors of each class.

Unsupervised examples are clustered some some clustering algorithm.
Minibatch consists of labelled data $\mathcal{X}_p^L \subset \mathcal{X}^L$ and unlabelled data $\mathcal{X}_q^U \subset \mathcal{X}^U$
The labelled part of the minibatch is constructed as usual by selecting $\mathcal{X}_p^L$ at random.
However $\mathcal{X}_q^U$ constructed by randomly selecting $l$ unlabelled clusters with $o$ samples in each cluster such that $q = l \times o$
The layer prior to the classification layer outputs the vectors $f_1,...,f_N$ for a batch of size $N$ which split into two groups of vectors $[f^L, f^U]$
Similarly the weight matrix can be split into two matrices $W^M, W^l$
$W^M$ consists of $M$ column vectors corresponding to anchors for each the $M$ classes whilst $W^l$ has $l$ column vectors corresponding to centroids of $l$ clusters.
The centroids are for the unlabelled data are obtained as follows

$$c_i^U = \alpha \sum_{i=1}^o \frac{f_{c,\iota}^U}{\lVert {f_{c,\iota}^U} \rVert_2} \\ \alpha = \frac{1}{M} \sum_{j=1}^M \lVert {c_j^L} \rVert_2$$

They show that the anchors i.e. the columns of $W$ converge to the centroids of the features $f$ of the layer prior to the classification layer for different datasets and different dimensions of $f$.
The weight update for $w_n$, the columns of $W$(i.e. learning rate $\eta$ times loss gradient with respect to $w_n$) can be shown to be:

$$\Delta w_n = -\eta\nabla_{w_n}l = -\eta\sum_{f \in I_n}(1-p_n)f + \eta\sum_{f \notin I_n}p_n'f $$

$$ p_n = \frac{\exp y_n}{\sum_{n'=1}^{N}{\exp y_{n'}}} \text{ }\text{(i.e. predicted probabilty that the class of the example is $n$)}$$

$$ y = W^Tf$$

The first term involves a weighted sum of the features of the examples belonging to class $n$.
We can think of this term as approximately pointing along the direction of the centroid.
On way to consider this is to note that for the examples with high predicted probabilities for class $n$ the dot product between $w_n$ and $f$ would have been large and positive.
So initially $w_n$ is more aligned with the features of these examples
However the weights for the gradient update are $1 - p_n$ so this causes $w$ to move closer to the features for the other examples