Papers
arxiv:2604.08706

Efficient RL Training for LLMs with Experience Replay

Published on Apr 9
· Submitted by
Vivien Cabannes
on Apr 14
Authors:
,
,
,

Abstract

Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.

AI-generated summary

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

Community

Paper submitter

Experience replay can cut LLM RL training compute by up to ~40% (without hurting final accuracy—and sometimes improving it).

Experience replay (reusing past rollouts) is a staple of classical RL, but is still underexplored in LLM post-training—where the default is “stay as on-policy as possible”.
In modern LLM RL pipelines, rollout generation can be >80% of total GPU time. Reusing rollouts even a little can save a lot of compute.

We studied a minimal, easy-to-drop-in replay buffer for async RL:
inference workers continuously push trajectories into a FIFO buffer
trainers sample uniformly from the buffer (sampling doesn’t remove items)

Main result: replay can slightly hurt performance per gradient step, but improves performance per unit of compute.
On MATH with Qwen2.5-7B, a well-chosen buffer reaches the same accuracy with up to ~40% less compute.

We also see a “slow-but-stable” effect: larger buffers learn more slowly, but training becomes more stable and can sometimes reach higher peak accuracy.
Replay can also help preserve output diversity → better pass@k for k>1.

Intuition: replay changes the effective training distribution. Mixing in older samples makes it more diverse over time than in purely on-policy training, which helps stabilize the training.

We also explored extensions beyond uniform replay:

  • alternative losses beyond GRPO
  • alternative sampling (e.g., biasing toward positive/correct trajectories)
    Early results look promising.

Theory: SGD with replay can converge faster as a function of compute by optimizing the trade-off between:

  • expensive rollout generation
  • staleness-induced variance
  • sample correlations / diversity
    It connects practical knobs—buffer size + replay ratio—to those costs.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08706
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.08706 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08706 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08706 in a Space README.md to link it from this page.

Collections including this paper 1