CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales Paper • 2606.21949 • Published 13 days ago
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding Paper • 2606.05259 • Published 30 days ago • 39
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing Paper • 2606.06042 • Published 29 days ago • 24
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory Paper • 2605.31336 • Published May 29 • 12
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models Paper • 2605.30263 • Published May 28 • 59
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Paper • 2605.26244 • Published May 25 • 38
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Paper • 2605.26244 • Published May 25 • 38
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Paper • 2605.26244 • Published May 25 • 38
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning Paper • 2605.22012 • Published May 21 • 46
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning Paper • 2605.22012 • Published May 21 • 46