knowledge distillation

a transfer learning method to use quality data from larger system to improve specialist capabilities (Hinton et al., 2015). Even more popular now with the rise of R1

Unlike Mixture of Expert, these specialist will only includes activations that is related to that specific fields. Conceptually, most of parameters within a neural networks are unused.

conceptually

usually NN produces a class probabilities via softmax layer that converts logits $z_i$ computed for each class into probability $q_i$ by comparing $z_i$ with other logits:

q_i = \frac{\text{exp}(z_i/T)}{\sum_{j} \text{exp}(z_j/T)} \tag{1}

where temperature $T$ is often set to 1.

We “distill” the knowledge by training the base specialist systems with a soft target distribution of the transfer set.

Each case in a transfer sets contributes to cross-entropy gradient $dC/d z_i$ with respects to each logit $z_i$ of the distilled model.

The gradient given by the training done at temperature $T$ that gives soft target probabilities $p_i$ ¹:

\frac{\partial C}{\partial z_i} = \frac{1}{T} (q_i - p_i) = \frac{1}{T} (\frac{e^{z_i/T}}{\sum_{j} e^{z_j/T}} - \frac{e^{v_i/T} }{\sum_{j} e^{v_j/T}}) \tag{2}

Bibliographie

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 [arXiv]

If the temperature is high compared to the magnitude of the logits, then we can approximate the following:
$\frac{\partial C}{\partial z_i} \approx \frac{1}{T} (\frac{1 + z_i/T}{N + \sum_{j} e^{z_j/T}} - \frac{1 + v_i/T}{N + \sum_{j} e^{v_j/T}}) \tag{3}$
We further simplified Eq.3 assuming logits have been zero-mean separately per transfer case:
$\frac{\partial C}{\partial z_i} \approx \frac{1}{NT^2} (z_i -v_i) \tag{4}$ ↩

knowledge distillation

Étiquette

publié à

modifié à

durée

source

conceptually

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

knowledge distillation

Étiquette

publié à

modifié à

durée

source

conceptually

Remarque

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour