Group Relative Policy Optimization

and RL.

Étiquette

ml

publié à
26 avr. 2025
modifié à
31 mai 2025
durée
1 min de lecture (50 words)
source
llms.txt

Group Relative Policy Optimization

a RL policy optimization where the critic model is the same size as the policy models

samples a group of ouputs $\{ o_{1}, o_{2}, \dots, o_{G} \}$ from given policy $\pi_{\theta_{\text{old}}}$ and optimize policy model $\pi_{\theta }$ :

\begin{aligned} \mathcal{I}_{\text{GRPO}}(\theta ) = \mathbf{E}[q \approx P(Q), \{ o_{i} \}^{G}_{i=1} \approx \pi_{\theta_{\text{old}}}(O|q)] \end{aligned}

a RL policy optimization where the critic model is the same size as the policy models

samples a group of ouputs $\{ o_{1}, o_{2}, \dots, o_{G} \}$ from given policy $\pi_{\theta_{\text{old}}}$ and optimize policy model $\pi_{\theta }$ :

\begin{aligned} \mathcal{I}_{\text{GRPO}}(\theta ) = \mathbf{E}[q \approx P(Q), \{ o_{i} \}^{G}_{i=1} \approx \pi_{\theta_{\text{old}}}(O|q)] \end{aligned}

Vous pourriez aimer ce qui suit

non-deterministic finite automaton

Liens retour

Group Relative Policy Optimization

Étiquette

publié à

modifié à

durée

source

Vous pourriez aimer ce qui suit

Liens retour