Abstract
Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.
Community
Trust-Region Behavior Blending for On-Policy Distillation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation (2026)
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training (2026)
- ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation (2026)
- Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning (2026)
- Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models (2026)
- On-Policy Distillation with Best-of-N Teacher Rollout Selection (2026)
- When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/17f89787-4413-46ff-add5-ba4d0ad3f9ca
Generated automatically by ResearchPod — happy to take feedback from the authors.
the per-prefix solver that searches for the largest feasible interpolation beta inside the KL budget is a neat trick, but i wonder how sensitive it is to the top-k truncation used to estimate the teacher signal. monotone beta family makes the binary search cheap, yet early in training the kl budget can be tiny; does that push trb too close to pure teacher guidance and hurt the student's own exploration? an ablation varying the warmup horizon or the annealing schedule would help show whether the gains rely on a long grace period or just a couple of early steps. btw the arxivlens breakdown helped me parse the method details, it unwinds the per-prefix decision and the top-k cue nicely: https://arxivlens.com/PaperView/Details/trust-region-behavior-blending-for-on-policy-distillation-4613-f51b7d34. overall, trb seems like a principled bridge rather than a hack, and i can see it being handy for other on-policy setups where teacher signals are noisy early on.
Get this paper in your agent:
hf papers read 2605.31159 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper