Flash-WAM: Modality-Aware Distillation for World Action Models
Abstract
Flash-WAM introduces a modality-aware step-distillation framework for world-action models that achieves real-time inference by adapting consistency functions to different noise regimes in video and action streams.
World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising (2026)
- DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving (2026)
- MotuBrain: An Advanced World Action Model for Robot Control (2026)
- NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models (2026)
- Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models (2026)
- SANTS: A State-Adaptive Scheduler for World Action Models (2026)
- RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.05254 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper