Abstract
PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.
Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
Community
Standard RL rewards LLM agents for good actions, but ignores what those actions change. Our proposed PaW turns every RL rollout into world-modeling supervision, helping agents predict next observations and act more reliably in long-horizon tasks, without adding any deployment cost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GAGPO: Generalized Advantage Grouped Policy Optimization (2026)
- StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction (2026)
- Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy (2026)
- Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning (2026)
- GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering (2026)
- AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning (2026)
- ECHO: Terminal Agents Learn World Models for Free (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/c41f5ec6-2784-484c-acc6-ab5a9e759cad
Generated automatically by ResearchPod โ happy to take feedback from the authors.
Get this paper in your agent:
hf papers read 2606.02388 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper