Self-Distilled Policy Gradient

Abstract

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. SDPG instantiates this signal as an auxiliary full-vocabulary student-to-teacher reverse KL loss and combines it with group-relative verifier advantages, normalized standard deviation, and reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines.

Method

Verifier rewards plus privileged self-distillation

SDPG trains one deployable policy under two views of the same model: an ordinary student view that sees only the problem, and a privileged teacher view that additionally sees answer-side context. The method keeps the verifier as the final arbiter, while using the privileged distribution to shape token-level credit assignment on useful rollouts.

01

Outcome policy gradient

SDPG keeps the binary verifier objective used in RLVR and computes group-relative advantages over sampled responses, preserving the selection pressure that helps the policy discover correct solutions.

02

Full-vocabulary OPD

The same model serves as student and privileged teacher. On sampled prefixes, SDPG minimizes $D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t])$, giving dense token-level guidance without a separate larger teacher.

03

Policy anchor and gates

Reference-policy KL, positive-advantage gating, and a warmup-decay schedule for $\beta(k)$ keep the privileged signal useful without over-constraining the reasoning policy.

Local policy-gradient view

With the privileged branch detached, reverse-KL OPD has the same fixed-prefix student-side gradient as a detached-sampling update with a centered log teacher/student ratio advantage.

$\nabla_\theta D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t]) \;\Longleftrightarrow\; \nabla_\theta\mathbb E_{a\sim p_t} [-\log p_t(a)\,\mathrm{SG}(\bar D_t-\log \bar p_t(a)/\bar q_t(a))]$

Objective Analysis

What each term contributes

The SDPG loss is deliberately decomposed into sparse selection, dense privileged guidance, and policy anchoring. This makes the optimization behavior easier to reason about than treating self-distillation as a black-box auxiliary loss.

1. Group-relative outcome advantage

$A_{\mathrm{out}}^{(i)} = \dfrac{R(x,y^{(i)})-\mu_G}{\sigma_G+\epsilon_{\mathrm{std}}}$

The verifier supplies sequence-level rewards. Normalizing within a group keeps the update comparative: correct rollouts are promoted, poor rollouts are suppressed, and uninformative groups contribute little when all sampled rewards match.

2. Exact full-vocabulary OPD

$\ell^{\mathrm{OPD}}_{i,t} = \sum_{a\in\mathcal V}p_{i,t}(a) \log\dfrac{p_{i,t}(a)}{\mathrm{SG}[q_{i,t}(a)]}$

Instead of distilling only the sampled token, SDPG compares the full next-token distributions. This gives dense supervision over every vocabulary item at a sampled reasoning prefix.

3. Positive-advantage gate

$m_i=\mathbf 1[A_{\mathrm{out}}^{(i)}>0],\quad \mathcal L_{\mathrm{OPD}}^+ =\mathbb E\left[\sum_{i,t}m_i\ell^{\mathrm{OPD}}_{i,t}\right]$

Privileged context can still produce plausible continuations on a globally wrong trajectory. The gate applies OPD only when the verifier prefers the rollout within its group.

4. Warmup-decay distillation weight

$\beta(k)=\beta_{\mathrm{base}} \min(1,k/T_{\mathrm{warm}}) \min(1,(T-k)/T_{\mathrm{decay}})$

Early warmup avoids trusting a noisy privileged target too soon. Late decay releases the model from information that is unavailable at inference after the useful signal has been internalized.

Two KL anchors used in SDPG

SDPG evaluates unnormalized KL regularization against a fixed reference policy. The reverse form penalizes squared log drift, while the forward form uses an inverse-ratio plus log-ratio term. Both variants keep the student close enough to the reference model that dense distillation does not dominate the reward objective.

URKL surrogate $\mathcal L_{\mathrm{URKL}} = \mathcal L_{\mathrm{R\&D}}+ \alpha\,\mathbb E[\frac{1}{2}\log^2 \frac{\pi_\theta(y_t\mid s_t)}{\pi_{\mathrm{ref}}(y_t\mid s_t)}]$

UFKL surrogate $\mathcal L_{\mathrm{UFKL}} = \mathcal L_{\mathrm{R\&D}}+ \alpha\,\mathbb E[ \frac{\pi_{\mathrm{ref}}(y_t\mid s_t)}{\pi_\theta(y_t\mid s_t)} +\log\frac{\pi_\theta(y_t\mid s_t)}{\pi_{\mathrm{ref}}(y_t\mid s_t)}]$

The distillation coefficient first warms up, then decays. This makes privileged OPD strongest after initial exploration has found useful trajectories, and weaker near the end of training.

Training loop

Sample prompts with privileged contexts $(x,c)$ from the training set.
Generate a group of responses from the unprivileged policy $\pi_\theta(\cdot\mid x)$.
Score each response with the binary verifier and compute group-relative advantages.
For positively advantaged rollouts, compute full-vocabulary OPD against the detached privileged distribution.
Update the policy with outcome loss, gated OPD, and reference-policy KL regularization.

Experiments

Stable gains on mathematical reasoning

Experiments use Qwen3 models trained for 400 steps on DAPO-Math-17k and evaluated on AIME2024, AIME2025, and AMC23 with pass@1 mean@32.

Early lift The accuracy gap between SDPG and GRPO opens within roughly the first 50 steps on the 4B run.

Entropy stability SDPG-UFKL keeps actor entropy substantially higher, while RLSD collapses toward zero around step 250.

Shorter reasoning SDPG response lengths settle between terse collapse and GRPO's more verbose outputs.

Qwen3-4B performance after 400 training steps
Method	AIME24		AIME25		AMC23
Method	Last	Best	Last	Best	Last	Best
GRPO	0.280	0.316	0.242	0.279	0.714	0.739
RLSD	0.378	0.395	0.300	0.304	0.813	0.813
SDPG-URKL	0.380	0.401	0.307	0.308	0.863	0.863
SDPG-UFKL	0.380	0.408	0.327	0.335	0.858	0.870

Experiments on Qwen3-4B

The top row tracks held-out benchmark accuracy. SDPG-URKL and SDPG-UFKL stay above GRPO and RLSD for most of training on AIME24, AIME25, and AMC23. The bottom row explains why: reward rises quickly, entropy remains healthy, and response length does not collapse.

Qwen3-4B training dynamics: SDPG variants improve AIME and AMC accuracy, reach reward plateaus earlier, and avoid the entropy collapse seen in RLSD.

Experiments on Qwen3-1.7B

Smaller models make pure self-distillation more fragile. OPCD degrades after step 250, while SDPG keeps the verifier objective and policy anchor active, preventing the sharp response-length and reward collapse seen in the self-distillation-only baseline.

Qwen3-1.7B results show the same pattern: SDPG outperforms GRPO, RLSD, and a pure self-distillation baseline across benchmarks.

Ablation

Why the pieces matter

Removing the OPD term loses the early-training accuracy advantage on AIME24 and AIME25, confirming that privileged full-vocabulary distillation is the main source of fast convergence on hard tasks.

Removing the reference-policy KL can keep accuracy competitive, but leads to shortened responses and rising entropy, indicating weaker control over coherent reasoning patterns.

Qwen3-4B ablation over OPD and KL regularization. The top row shows benchmark accuracy; the bottom row shows reward, entropy, and response length. Removing $\beta$ removes the dense OPD signal, while removing $\alpha$ weakens the reference-policy anchor.

Citation

@article{liu2026self,
  title  = {Self-Distilled Policy Gradient},
  author = {Liu, Yifeng and Zhang, Shiyuan and Zhang, Yifan and Gu, Quanquan},
  journal= {arXiv preprint arXiv:2606.04036}, 
  year   = {2026}
}