Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
Abstract
RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.
Community
We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a
two-stage framework that unifies distribution-matching distillation with
reward-guided RL for few-step flow generators. Minimizing the KL divergence to
a reward-tilted teacher distribution decomposes naturally into a
distribution-matching term and a reward-maximization term — instantiated
as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy
gradient (SubGRPO + final-step reward back-propagation) for the RL stage.
With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the
distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most
rewards.
the core idea of tilting the teacher with a reward and then splitting the KL into a dist-matching term and a reward-maximization term is clean and practically appealing. stage i's ambient-consistent distribution matching and stage ii's hybrid gradient with step-subset grpo look like they stabilize training in a tight 4-step regime. the arxivlens breakdown helped me parse the method details, especially how the consistency regularizer keeps the fake score aligned as the generator shifts (https://arxivlens.com/PaperView/Details/reinforcing-few-step-generators-via-reward-tilted-distribution-matching-8719-f7e85876). my one question is about ablations on the ac-dmd subintervals: how sensitive is performance to the number of subintervals, and did you try adaptive or learned partitioning rather than fixed blocks? this seems like a solid blueprint for fast, preference-aligned generation, with a practical angle for real-world deployment.
Hi, we directly adopt 4 subintervals for a 4-step generator, which is a natural choice. Other trials will be left for future research. Thanks for your appreciation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models (2026)
- Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning (2026)
- $R_\text{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation (2026)
- Continuous-Time Distribution Matching for Few-Step Diffusion Distillation (2026)
- V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think (2026)
- Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models (2026)
- Stepwise Credit Assignment for GRPO on Flow-Matching Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.26108 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Harahan/SD35M-RTDMD
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper