arxiv:2604.08706

Efficient RL Training for LLMs with Experience Replay

Published on Apr 9

· Submitted by

Vivien Cabannes on Apr 14

Meta Llama

Upvote

Authors:

Charles Arnal ,

Abstract

Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.

AI-generated summary

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

View arXiv page View PDF Add to collection

Community

Vivc

Paper submitter about 19 hours ago

Experience replay can cut LLM RL training compute by up to ~40% (without hurting final accuracy—and sometimes improving it).

Experience replay (reusing past rollouts) is a staple of classical RL, but is still underexplored in LLM post-training—where the default is “stay as on-policy as possible”.
In modern LLM RL pipelines, rollout generation can be >80% of total GPU time. Reusing rollouts even a little can save a lot of compute.

We studied a minimal, easy-to-drop-in replay buffer for async RL:
inference workers continuously push trajectories into a FIFO buffer
trainers sample uniformly from the buffer (sampling doesn’t remove items)

Main result: replay can slightly hurt performance per gradient step, but improves performance per unit of compute.
On MATH with Qwen2.5-7B, a well-chosen buffer reaches the same accuracy with up to ~40% less compute.

We also see a “slow-but-stable” effect: larger buffers learn more slowly, but training becomes more stable and can sometimes reach higher peak accuracy.
Replay can also help preserve output diversity → better pass@k for k>1.

Intuition: replay changes the effective training distribution. Mixing in older samples makes it more diverse over time than in purely on-policy training, which helps stabilize the training.

We also explored extensions beyond uniform replay:

alternative losses beyond GRPO
alternative sampling (e.g., biasing toward positive/correct trajectories)
Early results look promising.

Theory: SGD with replay can converge faster as a function of compute by optimizing the trade-off between:

expensive rollout generation
staleness-induced variance
sample correlations / diversity
It connects practical knobs—buffer size + replay ratio—to those costs.