MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents Paper • 2601.12346 • Published Jan 18 • 52
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought Paper • 2506.04277 • Published Jun 4, 2025
VEU-Bench: Towards Comprehensive Understanding of Video Editing Paper • 2504.17828 • Published Apr 24, 2025
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding Paper • 2603.16124 • Published Mar 17 • 3
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 8 days ago • 255
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published 11 days ago • 233
Watch Before You Answer: Learning from Visually Grounded Post-Training Paper • 2604.05117 • Published 11 days ago • 35
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 8 days ago • 255
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding Paper • 2603.16124 • Published Mar 17 • 3
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis Paper • 2603.20278 • Published about 1 month ago • 94
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published Jan 23 • 40
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning Paper • 2601.18631 • Published Jan 26 • 48
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models Paper • 2601.20354 • Published Jan 28 • 112
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search Paper • 2509.25454 • Published Sep 29, 2025 • 148