arxiv:2605.31159

Trust-Region Behavior Blending for On-Policy Distillation

Published on May 29

· Submitted by

Alexey Gorbatovski on Jun 1

#2 Paper of the day

T-Tech

Upvote

Authors:

Alexey Gorbatovski ,

Alexey Malakhov ,

Nikita Balagansky ,

Daniil Gavrilov

Abstract

Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

View arXiv page View PDF Add to collection

Community

Myashka

Paper author Paper submitter 3 days ago

•

edited 2 days ago

Trust-Region Behavior Blending for On-Policy Distillation

https://x.com/AMyashka/status/2061431653797425288

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

noahml

1 day ago

Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/17f89787-4413-46ff-add5-ba4d0ad3f9ca

Generated automatically by ResearchPod — happy to take feedback from the authors.

avahal

9 minutes ago

the per-prefix solver that searches for the largest feasible interpolation beta inside the KL budget is a neat trick, but i wonder how sensitive it is to the top-k truncation used to estimate the teacher signal. monotone beta family makes the binary search cheap, yet early in training the kl budget can be tiny; does that push trb too close to pure teacher guidance and hurt the student's own exploration? an ablation varying the warmup horizon or the annealing schedule would help show whether the gains rely on a long grace period or just a couple of early steps. btw the arxivlens breakdown helped me parse the method details, it unwinds the per-prefix decision and the top-k cue nicely: https://arxivlens.com/PaperView/Details/trust-region-behavior-blending-for-on-policy-distillation-4613-f51b7d34. overall, trb seems like a principled bridge rather than a hack, and i can see it being handy for other on-policy setups where teacher signals are noisy early on.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31159

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31159 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31159 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31159 in a Space README.md to link it from this page.

Trust-Region Behavior Blending for On-Policy Distillation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2