Papers
arxiv:2606.17682

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Published on Jun 16
ยท Submitted by
Chen chao
on Jun 18
Authors:
,
,
,
,

Abstract

A framework automates environment redesign in reinforcement learning for large language models by having the policy analyze failures and suggest configuration changes, achieving superior performance over larger proprietary models and fixed-environment baselines.

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

Community

๐Ÿš€ From Trainee to Trainer.

No manual environment redesign.
No guessing where the model still struggles. ๐Ÿค”

Let the model diagnose its own failures,
and design the next training environment by itself. ๐Ÿง 

Like a learner who gradually becomes its own teacher,
the policy looks at where it fails, understands what still blocks learning,
and reshapes the RL environment for the next stage. ๐Ÿ”

We propose LLM-as-Environment-Engineer,
a new framework where LLMs not only learn in environments,
but also learn to engineer the environments that teach them better. ๐ŸŒโœจ

On multi-agent reasoning tasks, a Qwen3-4B model trained with our framework surpasses stronger baselines, including larger proprietary LLMs.

The trainee becomes the trainer.
And training becomes self-improving. ๐Ÿš€

image

Cool paper - I liked the way "From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/ac722eed-9ab9-45b8-8d66-e550751729cd

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17682 in a Space README.md to link it from this page.

Collections including this paper 1