A key idea here is that the classifier deals differently with the labelled and unlabelled examples.
The classifier weights are not influenced by the unlabelled data.
In ordered for the loss for unlabelled examples to be low, they should be close to their centroids which means they must be close to each other.
Thus the feature extractor is encouraged to push similar examples close to each other which will also help push together the embeddings for similar labelled examples.
The model performs better than the following semi-/unsupervised approaches
- Clustering the unlabelled data and training the model using the cluster centres as labels
- Pseudo-labelling training on labelled data and then obtaining predicted labels for the unlabelled data using the trained model and using these as labels for subsequent training
- Using a triplet loss for with the unlabelled data

There will tend to be more clusters than classes so it is possible that groups of unlabelled examples whose true label is the same are assigned to different clusters by the clustering algorithm.
This only matters if more than one such group is present in the mini-batch as conflicting messages will then be sent to the feature detector about how to deal with features in a class with more than one cluster.
Assuming that there are equal numbers of clusters that actually belong to each class, the larger the batch the more likely it is that it contains more than one cluster belonging to the same class.