arxiv:2606.17682

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Published on Jun 16

· Submitted by

Chen chao on Jun 18

LARK Lab@HKUST (GZ)

Upvote

Authors:

Abstract

A framework automates environment redesign in reinforcement learning for large language models by having the policy analyze failures and suggest configuration changes, achieving superior performance over larger proprietary models and fixed-environment baselines.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

View arXiv page View PDF Project page GitHub 7 Add to collection

Community

FYYDCC

Paper submitter about 9 hours ago

•

edited about 8 hours ago

🚀 From Trainee to Trainer.

No manual environment redesign.
No guessing where the model still struggles. 🤔

Let the model diagnose its own failures,
and design the next training environment by itself. 🧠

Like a learner who gradually becomes its own teacher,
the policy looks at where it fails, understands what still blocks learning,
and reshapes the RL environment for the next stage. 🔁

We propose LLM-as-Environment-Engineer,
a new framework where LLMs not only learn in environments,
but also learn to engineer the environments that teach them better. 🌍✨

On multi-agent reasoning tasks, a Qwen3-4B model trained with our framework surpasses stronger baselines, including larger proprietary LLMs.

The trainee becomes the trainer.
And training becomes self-improving. 🚀

noahml

about 1 hour ago

Cool paper - I liked the way "From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/ac722eed-9ab9-45b8-8d66-e550751729cd

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17682 in a Space README.md to link it from this page.