• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

a transfer learning method to use quality data from larger system to improve specialist capabilities (Hinton et al., 2015). Even more popular now with the rise of R1

Unlike Mixture of Expert, these specialist will only includes activations that is related to that specific fields. Conceptually, most of parameters within a neural networks are unused.

conceptually

usually NN produces a class probabilities via softmax layer that converts logits ziz_i computed for each class into probability qiq_i by comparing ziz_i with other logits:

qi=exp(zi/T)jexp(zj/T)(1)q_i = \frac{\text{exp}(z_i/T)}{\sum_{j} \text{exp}(z_j/T)} \tag{1}

where temperature TT is often set to 1.

We “distill” the knowledge by training the base specialist systems with a soft target distribution of the transfer set.

Each case in a transfer sets contributes to cross-entropy gradient dC/dzidC/d z_i with respects to each logit ziz_i of the distilled model.

The gradient given by the training done at temperature TT that gives soft target probabilities pip_i 1:

Czi=1T(qipi)=1T(ezi/Tjezj/Tevi/Tjevj/T)(2)\frac{\partial C}{\partial z_i} = \frac{1}{T} (q_i - p_i) = \frac{1}{T} (\frac{e^{z_i/T}}{\sum_{j} e^{z_j/T}} - \frac{e^{v_i/T} }{\sum_{j} e^{v_j/T}}) \tag{2}

Remarque

  1. If the temperature is high compared to the magnitude of the logits, then we can approximate the following:

    Czi1T(1+zi/TN+jezj/T1+vi/TN+jevj/T)(3)\frac{\partial C}{\partial z_i} \approx \frac{1}{T} (\frac{1 + z_i/T}{N + \sum_{j} e^{z_j/T}} - \frac{1 + v_i/T}{N + \sum_{j} e^{v_j/T}}) \tag{3}

    We further simplified Eq.3 assuming logits have been zero-mean separately per transfer case:

    Czi1NT2(zivi)(4)\frac{\partial C}{\partial z_i} \approx \frac{1}{NT^2} (z_i -v_i) \tag{4}