yunyangge/OSP-Next · Hugging Face

Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Open-Sora Plan · Next Generation

A scalable sparse text-to-video diffusion model, introducing Skiparse-2D Attention, Sparse Sequence Parallelism (SSP), HiF8 quantization, and Mix-GRPO + LoRA RL post-training.

🧠 Model Summary

OSP-Next is a 14B-parameter text-to-video diffusion model built on top of the Wan 2.1 text encoder / VAE backbone, with four tightly co-designed contributions:

	What it is	Why it matters
🧩 Skiparse-2D Attention	Fixed-rule 2D sparse attention applied along H/W.	Approaches 3D full attention in quality, natively FlashAttention compatible.
🔗 Sparse Sequence Parallelism (SSP)	A parallel strategy natively co-designed with Skiparse-2D.	−75% inter-rank comm, per-block comm rounds 4 → 1.
🪶 HiF8 Quantization (NPU only)	Dynamic-precision 8-bit (exponent / mantissa allocation).	First joint 8-bit + sparse fine-tuning — up to 2.27× speedup on a single Ascend 950PR with only −0.4 pt on VBench.
🎯 Mix-GRPO + LoRA RL	RL post-training on top of the sparse model.	First RL pipeline for sparse video diffusion.

📊 End-to-end speed-ups (vs. Wan 2.1 baseline, 5 s · 81-frame video)

Hardware	720P (padded)	768P (native)
⚡ NVIDIA H200 (BF16 · FA3 · `torch.compile`)	1.53× / 1.42× (1× / 8× GPU)	1.64× / 1.52×
🟣 Ascend 950PR (BF16 · SDPA)	1.27× (1× NPU)	1.76×
🪶 Ascend 950PR (HiF8 · 8-bit · SDPA)	1.69×	2.27×

🏆 OSP-Next reaches VBench total = 83.73% (Wan 2.1 baseline 83.69%); OSP-Next-HiF8 keeps 83.29% with only a 0.4 pt drop. Full benchmark tables, ablations and qualitative comparisons live in the paper.

📦 What's in this repository

File / folder	Description
`OSP-Next-14B/`	OSP-Next 14B BF16 diffusion weights (FSDP `model.pt` + config)
`OSP-Next-HiF8-14B/`	HiF8-quantized 14B weights (NPU inference)
`config.json`	OSP-Next model architecture metadata

ℹ️ OSP-Next reuses Wan 2.1's T5 (UMT5-XXL) text encoder and WAN VAE verbatim. We do not re-host them — see Wan-AI/Wan2.1-T2V-14B for the upstream weights.

🚀 Quick Start

OSP-Next ships as a standalone training & inference repository rather than a pip-installable model class — the sparse attention / SSP comm / HiF8 kernels all live inside the project. The typical flow:

# 1. Clone the code repo
git clone https://github.com/PKU-YuanGroup/OSP-Next.git
cd OSP-Next
conda create -n ospnext python=3.10 -y && conda activate ospnext
pip install -e .

# 2a. Download OSP-Next weights from this Hugging Face repo
huggingface-cli download yunyangge/OSP-Next --local-dir ./checkpoints/osp_next_14b

# 2b. Download Wan 2.1's T5 text encoder and WAN VAE (the components we reuse)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
    models_t5_umt5-xxl-enc-bf16.pth \
    Wan2.1_VAE.pth \
    --include "google/umt5-xxl/*" \
    --local-dir ./checkpoints/Wan2.1-T2V-14B

# 3. Point the inference config at the three downloaded directories
$EDITOR configs/infer/gpu/osp_14b.yaml

# 4. Run inference
bash scripts/infer/gpu/infer_osp_14b.sh

In the inference YAML you'll fill in:

model_config:
  pretrained_model_dir_or_checkpoint: "./checkpoints/osp_next_14b"
vae_config:
  vae_path: "./checkpoints/Wan2.1-T2V-14B/Wan2.1_VAE.pth"
text_encoder_config:
  checkpoint_path: "./checkpoints/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth"
  text_tokenizer_path: "./checkpoints/Wan2.1-T2V-14B/google/umt5-xxl/"

🟣 On Ascend NPU? Follow the NPU setup in the code repo (CANN 8.5.0 + pip install -e .[npu] + source-build decord), then run scripts/infer/npu/infer_osp_14b.sh instead.

🐍 Programmatic loading

The diffusion model itself can also be loaded as a regular OSPNextModel:

from ospnext.modules.osp_next import OSPNextModel

model = OSPNextModel.from_pretrained("./checkpoints/osp_next_14b")
model = model.to("cuda", dtype="bfloat16").eval()

For the full text-to-video pipeline (T5 encoding → diffusion → VAE decoding), use ospnext.pipelines.t2v_pipeline.T2VPipeline — see infer/infer_osp.py for a complete example.

🏋️ Training & RL Post-Training

OSP-Next supports both SFT (train/train_osp.py) and Mix-GRPO + LoRA RL post-training (train/train_osp_RL.py) using the same FSDP2 + Sparse-SP backbone. Highlights of the RL pipeline:

LoRA-only updates on the frozen base model.
Mix-GRPO — mixed ODE/SDE flow-matching RL with a configurable SDE step count, KL penalty and group advantage clipping.
VideoAlign as the multi-axis reward model.
RL checkpoints only store the LoRA adapter (no base model duplication), plus an EMA-LoRA companion for inference. Merge them back into the base with merge_lora_weights.py before running inference.

Full training / RL recipes, config reference, sequence-parallel sizing tables and troubleshooting tips are in the code repository README.

🧪 Intended Use & Limitations

Intended uses

Research on sparse video diffusion: Skiparse-2D, Sparse Sequence Parallelism, joint sparse + 8-bit quantization, sparse-model RL.
Text-to-video generation for non-commercial creative / educational use.

Out of scope

Generating photo-realistic or identifiable likenesses of real individuals.
Generating illegal, deceptive, harmful, sexually explicit, or copyright-infringing content.

Known limitations

14B model — single-GPU inference needs a 80 GB-class accelerator (H100 / H200 / A100 80GB / Ascend 910B / 950PR). Multi-GPU is supported and recommended via the included SSP / FSDP2 launch scripts.
HiF8 weights are tuned for the Ascend NPU custom kernel; the BF16 model is the recommended starting point on NVIDIA GPUs.
Multi-NPU 950PR numbers are not yet reported — current 950PR results in the paper / model card are single-NPU only.

📚 Training Data

OSP-Next is trained on the same large-scale text-video corpus used by the Open-Sora-Plan lineage, plus internal data filtering / re-captioning pipelines (see the paper for details). No personal identifiable information is intentionally included, and any sensitive content is filtered prior to training to the best of our ability.

The RL post-training uses a text-only prompt corpus scored by VideoAlign.

📝 Citation

If you find OSP-Next useful in your research, please cite:

@article{ge2026ospnext,
  title        = {OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning},
  author       = {Ge, Yunyang and He, Xianyi and Zhang, Zezhong and Lin, Bin and Zhu, Bin and Cheng, Xinhua and Yuan, Li},
  journal      = {arXiv preprint arXiv:<ARXIV_ID>},
  year         = {2026},
}

This work builds on:

@article{wan2025wan,
  title={Wan: Open and advanced large-scale video generative models},
  author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others},
  journal={arXiv preprint arXiv:2503.20314},
  year={2025}
}

@article{lin2024open,
  title={Open-sora plan: Open-source large video generation model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}

@article{li2025mixgrpo,
  title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde},
  author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng},
  journal={arXiv preprint arXiv:2507.21802},
  year={2025}
}

🙏 Acknowledgements

🌊 Wan — WAN-VAE and T5 backbone.
🎬 Open-Sora-Plan — the ecosystem this project extends.
🏅 VideoAlign — reward model for RL post-training.
🎯 Mix-GRPO — mixed ODE-SDE flow-matching RL.

📄 License

Released under Apache 2.0 — see LICENSE.txt in the code repository.

The reused Wan 2.1 T5 / VAE weights are governed by their own licenses at Wan-AI/Wan2.1-T2V-14B.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for yunyangge/OSP-Next

Base model

Wan-AI/Wan2.1-T2V-14B

Adapter

(69)

this model

Papers for yunyangge/OSP-Next