a transfer learning method to use quality data from larger system to improve specialist capabilities (Hinton et al., 2015). Even more popular now with the rise of R1
Unlike Mixture of Expert, these specialist will only includes activations that is related to that specific fields. Conceptually, most of parameters within a neural networks are unused.
conceptually
usually NN produces a class probabilities via softmax layer that converts logits computed for each class into probability by comparing with other logits:
where temperature is often set to 1.
We “distill” the knowledge by training the base specialist systems with a soft target distribution of the transfer set.
Each case in a transfer sets contributes to cross-entropy gradient with respects to each logit of the distilled model.
The gradient given by the training done at temperature that gives soft target probabilities 1:
Remarque
-
If the temperature is high compared to the magnitude of the logits, then we can approximate the following:
We further simplified Eq.3 assuming logits have been zero-mean separately per transfer case:
↩