DeepSeek R1

huggingface/open-r1, model, pdf (DeepSeek-AI et al., 2025)

reasoning and distill variants trained on high-quality RL data

scaling inference-time compute based on DeepSeek-V3 and employs GRPO (Shao et al., 2024)

Three major components:

R1-Zero: Pure RL on base models without any SFT
R1: RL on pure CoT, not any clever training data
Distill: knowledge distillation from R1 to improve smaller variants

R1-Zero

Uses GRPO (Group Relative Policy Optimization) from Shao et al. (2024) ¹

a RL policy optimization where the critic model is the same size as the policy models

samples a group of ouputs $\{ o_{1}, o_{2}, \dots, o_{G} \}$ from given policy $\pi_{\theta_{\text{old}}}$ and optimize policy model $\pi_{\theta }$ :

\begin{aligned} \mathcal{I}_{\text{GRPO}}(\theta ) = \mathbf{E}[q \approx P(Q), \{ o_{i} \}^{G}_{i=1} \approx \pi_{\theta_{\text{old}}}(O|q)] \end{aligned}

R1

Distill

DeepSeek-V3

uses Multi-head Latent Attention (MLA), a mixture-of-expert model.

auxiliary-loss-free strategy for load balancing and
a multi-token prediction training objective
DualPipe algorithm for efficient pipeline parallelism
near-zero all-to-all communication kernels to fully utilise InfiniBand and NVLink bandwidths.
finer-grained experts and isolates some experts as shared ones. (DeepSeek-AI et al., 2025, p. See section 2.1.2 for finer-grained experts)

figure1: DeepSeek-V3 architecture with MLA and MoE

multi-token prediction.

(Gloeckle et al., 2024)

figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth

tl/dr: predict $n$ -tokens at once, via shared trunk and n dedicated attention heads ²

Note that during inference, we only employ one attention head

Bibliographie

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., … Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948 [arXiv]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300 [arXiv]

Most of the innovation wrt RL can be found in the DeepSeekMath paper ↩

huggingface/open-r1, model, pdf (DeepSeek-AI et al., 2025)

reasoning and distill variants trained on high-quality RL data

scaling inference-time compute based on DeepSeek-V3 and employs GRPO (Shao et al., 2024)

Three major components:

R1-Zero: Pure RL on base models without any SFT
R1: RL on pure CoT, not any clever training data
Distill: knowledge distillation from R1 to improve smaller variants

R1-Zero

Uses GRPO (Group Relative Policy Optimization) from Shao et al. (2024) ¹

url: thoughts/Group-Relative-Policy-Optimization
Group Relative Policy Optimization
a RL policy optimization where the critic model is the same size as the policy models

samples a group of ouputs $\{ o_{1}, o_{2}, \dots, o_{G} \}$ from given policy $\pi_{\theta_{\text{old}}}$ and optimize policy model $\pi_{\theta }$ :

$\begin{aligned} \mathcal{I}_{\text{GRPO}}(\theta ) = \mathbf{E}[q \approx P(Q), \{ o_{i} \}^{G}_{i=1} \approx \pi_{\theta_{\text{old}}}(O|q)] \end{aligned}$ Lien vers l'original

R1

Distill

DeepSeek-V3

uses Multi-head Latent Attention (MLA), a mixture-of-expert model.

auxiliary-loss-free strategy for load balancing and
a multi-token prediction training objective
DualPipe algorithm for efficient pipeline parallelism
near-zero all-to-all communication kernels to fully utilise InfiniBand and NVLink bandwidths.
finer-grained experts and isolates some experts as shared ones. (DeepSeek-AI et al., 2025, p. See section 2.1.2 for finer-grained experts)

figure1: DeepSeek-V3 architecture with MLA and MoE

url: thoughts/Transformers
description: multi-token prediction
multi-token prediction.

(Gloeckle et al., 2024)

figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth

tl/dr: predict $n$ -tokens at once, via shared trunk and n dedicated attention heads ²

Note that during inference, we only employ one attention head
Lien vers l'original

Most of the innovation wrt RL can be found in the DeepSeekMath paper ↩

Gloeckle et al. (2024) employs $n=4$ . The order of the forward and backward in a n-token prediction model with $n=4$ heads of the shared trunk works as follow:

z = model.shared(x)
d = z.detach()
d.requires_grad = False
 
for i in range(n):
  p = model.heads[i](d)
  loss(p, y[i]).backward()
z.backward()

↩

DeepSeek R1

Étiquette

publié à

modifié à

durée

source

R1-Zero

Group Relative Policy Optimization

R1

Distill

DeepSeek-V3

multi-token prediction.

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

DeepSeek R1

Étiquette

publié à

modifié à

durée

source

R1-Zero

Group Relative Policy Optimization

R1

Distill

DeepSeek-V3

multi-token prediction.

Remarque

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour