Home

Transductive Centroid Projections - Part 1

Classifier weights as normals of decision hyperplane

  • A deep neural network can be regarded as a classifier attached to a feature extractor
  • The feature extractor consists of the all the layers but the final dense layer and outputs an embedding $f$ for an example $x$.
  • The output of the final classifier layer takes input the embedding $f$ and outputs the prediction $\hat{y} = W^Tf$.
  • Each element of the prediction $\hat{y}$ is given by $\hat{y}_n = (W^T)_nf = W_{:n}^Th_L = w_n^Tf$ i.e. dot product of the the $n$-th column of $W$ with the output of the .
  • The predicted class $n'$ will be the index $n'$ at which the dot product is the highest, meaning that out of all $w_n$ $f$ is closest to $w_n'$.
  • This means that the weights $W$ of the final dense layer lie in the direction of the normal vectors of the decision hyperplane learned by the model.
  • We call them anchors of each class.

How does the model work

  • Unsupervised examples are clustered some some clustering algorithm.
  • Minibatch consists of labelled data $\mathcal{X}_p^L \subset \mathcal{X}^L$ and unlabelled data $\mathcal{X}_q^U \subset \mathcal{X}^U$
  • The labelled part of the minibatch is constructed as usual by selecting $\mathcal{X}_p^L$ at random.
  • However $\mathcal{X}_q^U$ constructed by randomly selecting $l$ unlabelled clusters with $o$ samples in each cluster such that $q = l \times o$
  • The layer prior to the classification layer outputs the vectors $f_1,...,f_N$ for a batch of size $N$ which split into two groups of vectors $[f^L, f^U]$
  • Similarly the weight matrix can be split into two matrices $W^M, W^l$
  • $W^M$ consists of $M$ column vectors corresponding to anchors for each the $M$ classes whilst $W^l$ has $l$ column vectors corresponding to centroids of $l$ clusters.
  • The centroids are for the unlabelled data are obtained as follows

    $$c_i^U = \alpha \sum_{i=1}^o \frac{f_{c,\iota}^U}{\lVert {f_{c,\iota}^U} \rVert_2} \\ \alpha = \frac{1}{M} \sum_{j=1}^M \lVert {c_j^L} \rVert_2$$

Why use centroids

  • They show that the anchors i.e. the columns of $W$ converge to the centroids of the features $f$ of the layer prior to the classification layer for different datasets and different dimensions of $f$.
  • The weight update for $w_n$, the columns of $W$(i.e. learning rate $\eta$ times loss gradient with respect to $w_n$) can be shown to be:

    $$\Delta w_n = -\eta\nabla_{w_n}l = -\eta\sum_{f \in I_n}(1-p_n)f + \eta\sum_{f \notin I_n}p_n'f $$

    $$ p_n = \frac{\exp y_n}{\sum_{n'=1}^{N}{\exp y_{n'}}} \text{ }\text{(i.e. predicted probabilty that the class of the example is $n$)}$$

    $$ y = W^Tf$$

  • The first term involves a weighted sum of the features of the examples belonging to class $n$.
  • We can think of this term as approximately pointing along the direction of the centroid.
  • On way to consider this is to note that for the examples with high predicted probabilities for class $n$ the dot product between $w_n$ and $f$ would have been large and positive.
  • So initially $w_n$ is more aligned with the features of these examples
  • However the weights for the gradient update are $1 - p_n$ so this causes $w$ to move closer to the features for the other examples