Multimodal Reasoning
updated
InfiR : Crafting Effective Small Language Models and Multimodal Small
Language Models in Reasoning
Paper
• 2502.11573
• Published • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
• 2502.02339
• Published • 23
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
• 2502.11775
• Published • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published • 39
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published • 129
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published • 18
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Paper
• 2502.19634
• Published • 62
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published • 86
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published • 61
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published • 31
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published • 88
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published • 42
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
• 2411.14432
• Published • 25
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
• 2503.05379
• Published • 38
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published • 36
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published • 17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published • 35
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published • 30
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
• 2503.13444
• Published • 20
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published • 32
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
• 2503.17352
• Published • 24
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
• 2503.16549
• Published • 15
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published • 20
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published • 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published • 62
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
• 2503.16081
• Published • 28
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published • 67
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published • 32
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published • 15
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published • 86
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published • 13
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published • 20
Paper
• 2504.07491
• Published • 139
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published • 308
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published • 44
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published • 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
• 2504.09130
• Published • 12
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published • 19
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners
Paper
• 2504.14239
• Published • 14
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published • 58
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
• 2505.03318
• Published • 94
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
• 2505.04921
• Published • 187
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published • 15
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published • 157
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published • 30
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published • 10
OpenThinkIMG: Learning to Think with Images via Visual Tool
Reinforcement Learning
Paper
• 2505.08617
• Published • 42
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Paper
• 2505.11049
• Published • 61
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published • 57
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable
Step-Level Supervision
Paper
• 2505.13427
• Published • 26
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
• 2505.12081
• Published • 18
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published • 134
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
• 2505.14460
• Published • 33
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with
Reinforcement Learning
Paper
• 2505.14677
• Published • 15
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
Learning
Paper
• 2505.14231
• Published • 53
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with
Curiosity-Driven Reinforcement Learning
Paper
• 2505.15966
• Published • 53
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
• 2505.17022
• Published • 27
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
• 2505.17018
• Published • 15
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
• 2505.16854
• Published • 11
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published • 13
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based
on Speech and Audio Information
Paper
• 2505.13237
• Published • 1
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced
Multimodal Chain-of-Thought
Paper
• 2505.16192
• Published • 12
Training-Free Reasoning and Reflection in MLLMs
Paper
• 2505.16151
• Published • 9
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
• 2505.20256
• Published • 19
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
• 2505.13426
• Published • 13
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Paper
• 2505.15804
• Published • 10
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
• 2505.19084
• Published • 20
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied
Iterative Policy Optimization
Paper
• 2505.19000
• Published • 42
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Paper
• 2505.21374
• Published • 28
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
• 2505.21457
• Published • 16
Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with
Minimalist Rule-Based RL
Paper
• 2505.17952
• Published • 20
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large
Language Models via Share-GRPO
Paper
• 2505.16673
• Published • 2
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
• 2505.22651
• Published • 48
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published • 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
• 2505.22334
• Published • 36
Fostering Video Reasoning via Next-Event Prediction
Paper
• 2505.22457
• Published • 29
Thinking with Generated Images
Paper
• 2505.22525
• Published • 15
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published • 69
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
• 2505.23380
• Published • 22
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement
Learning
Paper
• 2505.22914
• Published • 39
Grounded Reinforcement Learning for Visual Reasoning
Paper
• 2505.23678
• Published • 2
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published • 13
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
• 2506.01713
• Published • 48
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged
Reinforcement Learning
Paper
• 2506.04207
• Published • 48
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual
Counting for MLLMs
Paper
• 2506.05328
• Published • 21
Perceptual Decoupling for Scalable Multi-modal Reasoning via
Reward-Optimized Captioning
Paper
• 2506.04559
• Published • 2
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error
Diagnosis in GUI Automation
Paper
• 2506.04614
• Published • 19
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Paper
• 2506.09790
• Published • 53
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware
Regressive GRPO
Paper
• 2506.07464
• Published • 14
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Paper
• 2506.13654
• Published • 43
VGR: Visual Grounded Reasoning
Paper
• 2506.11991
• Published • 20
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Paper
• 2506.16962
• Published • 10
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning
Paper
• 2506.16141
• Published • 27
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language
Models for Audio Generation and Editing
Paper
• 2506.21448
• Published • 9
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published • 253
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
• 2506.21277
• Published • 14
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published • 132
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published • 90
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based
Reinforcement Learning
Paper
• 2507.05920
• Published • 12
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper
• 2507.06448
• Published • 48
Skywork-R1V3 Technical Report
Paper
• 2507.06167
• Published • 74
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for
Visual Reasoning
Paper
• 2507.05255
• Published • 75
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published • 79
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper
• 2507.16746
• Published • 35
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent
Planning
Paper
• 2507.16815
• Published • 42
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking
Reasoning
Paper
• 2507.16814
• Published • 21
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published • 47
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published • 50
MolmoAct: Action Reasoning Models that can Reason in Space
Paper
• 2508.07917
• Published • 45
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual
Mathematical Reasoning
Paper
• 2508.10433
• Published • 146
HumanSense: From Multimodal Perception to Empathetic Context-Aware
Responses through Reasoning MLLMs
Paper
• 2508.10576
• Published • 8
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published • 110
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Paper
• 2509.00676
• Published • 85
Planning with Reasoning using Vision Language World Model
Paper
• 2509.02722
• Published • 24
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Paper
• 2509.06461
• Published • 20
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language
Models
Paper
• 2509.12132
• Published • 7
Multimodal Reasoning for Science: Technical Report and 1st Place
Solution to the ICML 2025 SeePhys Challenge
Paper
• 2509.06079
• Published • 6
BaseReward: A Strong Baseline for Multimodal Reward Model
Paper
• 2509.16127
• Published • 21
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper
• 2509.15566
• Published • 14
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods,
Results, Discussion, and Outlook
Paper
• 2509.14142
• Published • 10
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and
Open Resources
Paper
• 2509.21268
• Published • 104
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified
Self-Play
Paper
• 2509.25541
• Published • 141
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
• 2509.25848
• Published • 81
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Paper
• 2510.01623
• Published • 12
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models
Paper
• 2510.05034
• Published • 51
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
• 2510.08377
• Published • 81
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Paper
• 2510.06783
• Published • 13
Generative Universal Verifier as Multimodal Meta-Reasoner
Paper
• 2510.13804
• Published • 27
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published • 56
Directional Reasoning Injection for Fine-Tuning MLLMs
Paper
• 2510.15050
• Published • 12
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published • 86
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In
Text-only LLMs
Paper
• 2510.25092
• Published • 8
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with
Free-Form Preferences
Paper
• 2510.23451
• Published • 28
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
• 2511.02779
• Published • 60
Thinking with Video: Video Generation as a Promising Multimodal
Reasoning Paradigm
Paper
• 2511.04570
• Published • 242
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published • 98
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
Paper
• 2511.06805
• Published • 13
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Paper
• 2511.13026
• Published • 26
VisPlay: Self-Evolving Vision-Language Models from Images
Paper
• 2511.15661
• Published • 44
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Paper
• 2511.16671
• Published • 16
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Paper
• 2511.18373
• Published • 7
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper
• 2511.16334
• Published • 96
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Paper
• 2511.15705
• Published • 98
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
Paper
• 2511.20814
• Published • 2
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Paper
• 2511.15703
• Published • 9
MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Paper
• 2511.21087
• Published • 10
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Paper
• 2511.22625
• Published • 48
Geometrically-Constrained Agent for Spatial Reasoning
Paper
• 2511.22659
• Published • 41
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Paper
• 2511.22134
• Published • 22
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Paper
• 2512.02395
• Published • 51
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Paper
• 2511.22586
• Published • 7
Artemis: Structured Visual Reasoning for Perception Policy Learning
Paper
• 2512.01988
• Published • 2
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Paper
• 2511.19661
• Published • 3
OneThinker: All-in-one Reasoning Model for Image and Video
Paper
• 2512.03043
• Published • 34
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Paper
• 2512.03746
• Published • 17
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Paper
• 2512.05111
• Published • 50
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
Paper
• 2512.03667
• Published • 6
Rethinking Chain-of-Thought Reasoning for Videos
Paper
• 2512.09616
• Published • 19
VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
Paper
• 2512.06373
• Published • 9
Thinking with Images via Self-Calling Agent
Paper
• 2512.08511
• Published • 23
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Paper
• 2512.17532
• Published • 68
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
Paper
• 2512.12623
• Published • 4
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published • 121
Latent Implicit Visual Reasoning
Paper
• 2512.21218
• Published • 70
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Paper
• 2512.22120
• Published • 15
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Paper
• 2512.18745
• Published • 12
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Paper
• 2601.05175
• Published • 36
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Paper
• 2601.06803
• Published • 10
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Paper
• 2601.09536
• Published • 5
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Paper
• 2601.10477
• Published • 156
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Paper
• 2601.10129
• Published • 13
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
Paper
• 2601.13976
• Published • 22
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Paper
• 2601.14750
• Published • 17
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Paper
• 2601.15224
• Published • 12
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Paper
• 2601.21821
• Published • 62
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
Paper
• 2601.22069
• Published • 7
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling
Paper
• 2602.02453
• Published • 36
Training Data Efficiency in Multimodal Process Reward Models
Paper
• 2602.04145
• Published • 79
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Paper
• 2602.06040
• Published • 10