• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

huggingface/open-r1, model, pdf (DeepSeek-AI et al., 2025)

reasoning and distill variants trained on high-quality RL data

scaling inference-time compute based on DeepSeek-V3 and employs GRPO (Shao et al., 2024)

Three major components:

R1-Zero

Uses GRPO (Group Relative Policy Optimization) from Shao et al. (2024) 1

R1

Distill


DeepSeek-V3

uses Multi-head Latent Attention (MLA), a mixture-of-expert model.

  • auxiliary-loss-free strategy for load balancing and
  • a multi-token prediction training objective
  • DualPipe algorithm for efficient pipeline parallelism
  • near-zero all-to-all communication kernels to fully utilise InfiniBand and NVLink bandwidths.
  • finer-grained experts and isolates some experts as shared ones. (DeepSeek-AI et al., 2025, p. See section 2.1.2 for finer-grained experts)
DeepSeek-V3 architecture with MLA and MoE
figure1: DeepSeek-V3 architecture with MLA and MoE

multi-token prediction.

(Gloeckle et al., 2024)

MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth
figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth

tl/dr: predict nn-tokens at once, via shared trunk and n dedicated attention heads 2

Note that during inference, we only employ one attention head

Lien vers l'original

Remarque

  1. Most of the innovation wrt RL can be found in the DeepSeekMath paper

  2. Gloeckle et al. (2024) employs n=4n=4. The order of the forward and backward in a n-token prediction model with n=4n=4 heads of the shared trunk works as follow:

    z = model.shared(x)
    d = z.detach()
    d.requires_grad = False
     
    for i in range(n):
      p = model.heads[i](d)
      loss(p, y[i]).backward()
    z.backward()