Papers
arxiv:2606.18216

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Published on Jun 16
· Submitted by
Byung-Kwan Lee
on Jun 17
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Zone of Proximal Policy Optimization (ZPPO) improves knowledge distillation by using reformulated prompts that help students learn from both correct and incorrect responses, enhancing performance especially at smaller model sizes.

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

Community

Our research question is

"For hard questions, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient? In other words, how to make the student solve the hard question without policy drift (degrading generalization)?"

ZPPO is a new RL post-training recipe as well as solution of our research question: teacher in prompts, not gradients. For hard questions, BCQ lets the student compare a correct teacher response with a wrong student response, while NCQ shows the student its own wrong rollouts as negatives. This recovers learning signals where ordinary RL gets zero advantage, while keeping the policy gradient on-policy.

https://www.linkedin.com/feed/update/urn:li:activity:7472847184627986432/

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18216
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18216 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18216 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18216 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.