RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
Yanzuo Lu · Ronglai Zuo · Jiankang Deng — Imperial College London
Project page: https://yanzuo.lu/raven
Overview
RAVEN is a causal autoregressive text-to-video generation model built on Wan2.1-T2V-1.3B. It is designed for real-time streaming video generation by extrapolating future video chunks from previously generated content.
The release contains the RAVEN checkpoint plus three interchangeable CM-GRPO variants:
| File | Description |
|---|---|
raven_model.pt |
Full RAVEN backbone for causal autoregressive text-to-video generation. |
cmgrpo_raven_lora.safetensors |
CM-GRPO LoRA adapter only. Load raven_model.pt as the base weight and this file through the LoRA path. |
cmgrpo_raven_full.pt |
RAVEN base and CM-GRPO LoRA adapter packed into one PEFT-wrapped state dict. Load this file through the LoRA path without a separate base weight. |
cmgrpo_raven_merge.pt |
Full CM-GRPO backbone with the adapter already merged into RAVEN. Load this file as the base weight, with no LoRA block. |
RAVEN trains a causal video generator using a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This aligns the model's training attention pattern with inference-time autoregressive extrapolation and allows downstream chunk losses to supervise the historical representations used for future predictions.
We also release CM-GRPO weights. CM-GRPO formulates a consistency-model sampling step as a conditional Gaussian transition and applies online Group Relative Policy Optimization directly to this kernel.
Model details
- Base architecture: Wan2.1-T2V-1.3B DiT
- Task: text-to-video generation
- Generation mode: causal autoregressive video extrapolation
- Resolution used in released configs: 480 x 832
- Frames: 81
- FPS: 16
- Sampling steps: 4
- Sampler: consistency sampler
- Schedule: linear interpolation schedule,
v_lerpprediction type - Classifier-free guidance: not used; the
guidance_scale=3.0value in the configs is a placeholder for interface compatibility - Causal chunking:
chunk_size=3,independent_first_chunk=3,sink=0,window_size=null - VAE stride:
[4, 8, 8] - Latent channels: 16
- DiT config: dim 1536, 30 layers, 12 heads, FFN dim 8960, text length 512
Usage
This repository only hosts the released model weights. Please use the RAVEN codebase for inference and evaluation:
git clone https://github.com/YanzuoLu/RAVEN.git
cd RAVEN
Set up the environment:
conda env create -f tools/environment.yaml
conda activate raven
bash tools/prepare_venv.sh
source venv/bin/activate
Download this model repository:
hf download mvp-lab/RAVEN --local-dir /path/to/RAVEN-weights
Then point the relevant config files to the downloaded checkpoints. RAVEN itself (raven_model.pt) is a single full backbone:
"backbone": {
"weight": "/path/to/RAVEN-weights/raven_model.pt"
}
CM-GRPO can be loaded in any of three equivalent forms:
Adapter only (cmgrpo_raven_lora.safetensors):
"backbone": {
"weight": "/path/to/RAVEN-weights/raven_model.pt",
"lora": {
"enabled": true,
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_lora.safetensors"
}
}
Base + LoRA bundle (cmgrpo_raven_full.pt):
"backbone": {
"lora": {
"enabled": true,
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_full.pt"
}
}
Merged backbone (cmgrpo_raven_merge.pt):
"backbone": {
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_merge.pt"
}
The released CM-GRPO configs use the base + LoRA bundle form by default.
Reference configs:
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
Run qualitative generation:
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
Run VBench prompt-suite sampling:
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
Requirements
The released configs depend on the RAVEN codebase and the upstream Wan2.1-T2V-1.3B components, including:
- Wan2.1-T2V-1.3B diffusion backbone / DiT config
- Wan2.1 VAE
- UMT5-XXL tokenizer and text encoder
- Python 3.10
- CUDA 12.8
- PyTorch 2.11 + cu128
- flash-attention 2/3 and magi-attention as built by
tools/prepare_venv.sh
See the code repository README for full setup and evaluation instructions.
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file in the code repository for details.
The upstream Wan2.1 components are subject to their own licenses and terms. Users are responsible for complying with all applicable licenses for the base model, code, data, and dependencies.
Citation
If you find this work useful, please cite RAVEN. A BibTeX entry will be added when available.
@article{lu2026raven,
title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
year = 2026,
journal = {arXiv preprint arXiv:2605.15190}
}
Model tree for mvp-lab/RAVEN
Base model
Wan-AI/Wan2.1-T2V-1.3B