Papers
arxiv:2605.31159

Trust-Region Behavior Blending for On-Policy Distillation

Published on May 29
· Submitted by
Alexey Gorbatovski
on Jun 1
#2 Paper of the day
Authors:
,
,
,

Abstract

Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

Community

Paper author Paper submitter
edited 2 days ago

Trust-Region Behavior Blending for On-Policy Distillation

https://x.com/AMyashka/status/2061431653797425288

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/17f89787-4413-46ff-add5-ba4d0ad3f9ca

Generated automatically by ResearchPod — happy to take feedback from the authors.

the per-prefix solver that searches for the largest feasible interpolation beta inside the KL budget is a neat trick, but i wonder how sensitive it is to the top-k truncation used to estimate the teacher signal. monotone beta family makes the binary search cheap, yet early in training the kl budget can be tiny; does that push trb too close to pure teacher guidance and hurt the student's own exploration? an ablation varying the warmup horizon or the annealing schedule would help show whether the gains rely on a long grace period or just a couple of early steps. btw the arxivlens breakdown helped me parse the method details, it unwinds the per-prefix decision and the top-k cue nicely: https://arxivlens.com/PaperView/Details/trust-region-behavior-blending-for-on-policy-distillation-4613-f51b7d34. overall, trb seems like a principled bridge rather than a hack, and i can see it being handy for other on-policy setups where teacher signals are noisy early on.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31159
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31159 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31159 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31159 in a Space README.md to link it from this page.

Collections including this paper 2