Multimodal Reasoning

btjhjeon 's Collections

Code Reasoning

Code Agent

Multimodal Agent

LLM

Multimodal Benchmarks

updated Feb 7

Upvote

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17, 2025 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4, 2025 • 23
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17, 2025 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 129
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Paper • 2502.16033 • Published Feb 22, 2025 • 18
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Paper • 2502.19634 • Published Mar 19, 2025 • 62
Visual-RFT: Visual Reinforcement Fine-Tuning

Paper • 2503.01785 • Published Mar 3, 2025 • 86
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

Paper • 2503.07365 • Published Mar 10, 2025 • 61
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Paper • 2503.06749 • Published Mar 9, 2025 • 31
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 88
Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 42
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 25
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Paper • 2503.05379 • Published Mar 7, 2025 • 38
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Paper • 2503.12605 • Published Mar 16, 2025 • 35
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Paper • 2503.12937 • Published Mar 17, 2025 • 30
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Paper • 2503.13444 • Published Mar 17, 2025 • 20
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17, 2025 • 32
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21, 2025 • 24
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Paper • 2503.16549 • Published Mar 19, 2025 • 15
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Paper • 2503.18013 • Published Mar 23, 2025 • 20
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27, 2025 • 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27, 2025 • 62
OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

Paper • 2503.16081 • Published Mar 20, 2025 • 28
Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Paper • 2504.00883 • Published Apr 1, 2025 • 67
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Paper • 2504.02587 • Published Apr 3, 2025 • 32
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Paper • 2504.03151 • Published Apr 4, 2025 • 15
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Paper • 2504.05599 • Published Apr 8, 2025 • 86
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Paper • 2504.06958 • Published Apr 9, 2025 • 13
OmniCaptioner: One Captioner to Rule Them All

Paper • 2504.07089 • Published Apr 9, 2025 • 20
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 139
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10, 2025 • 44
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Paper • 2504.09641 • Published Apr 13, 2025 • 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Paper • 2504.09130 • Published Apr 12, 2025 • 12
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Paper • 2504.13055 • Published Apr 17, 2025 • 19
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Paper • 2504.14239 • Published Apr 19, 2025 • 14
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Paper • 2504.16656 • Published Apr 23, 2025 • 58
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Paper • 2505.03318 • Published May 6, 2025 • 94
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8, 2025 • 187
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Paper • 2505.03981 • Published May 6, 2025 • 15
Seed1.5-VL Technical Report

Paper • 2505.07062 • Published May 11, 2025 • 157
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Paper • 2505.07263 • Published May 12, 2025 • 30
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Paper • 2505.09439 • Published May 14, 2025 • 10
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Paper • 2505.08617 • Published May 13, 2025 • 42
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Paper • 2505.11049 • Published May 16, 2025 • 61
Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published May 16, 2025 • 57
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Paper • 2505.13427 • Published May 19, 2025 • 26
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Paper • 2505.12081 • Published May 17, 2025 • 18
Emerging Properties in Unified Multimodal Pretraining

Paper • 2505.14683 • Published May 20, 2025 • 134
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Paper • 2505.14460 • Published May 20, 2025 • 33
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Paper • 2505.14677 • Published May 20, 2025 • 15
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Paper • 2505.14231 • Published May 20, 2025 • 53
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Paper • 2505.15966 • Published May 21, 2025 • 53
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Paper • 2505.17022 • Published May 22, 2025 • 27
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Paper • 2505.17018 • Published May 22, 2025 • 15
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Paper • 2505.16854 • Published May 22, 2025 • 11
GRIT: Teaching MLLMs to Think with Images

Paper • 2505.15879 • Published May 21, 2025 • 13
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Paper • 2505.13237 • Published May 19, 2025 • 1
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Paper • 2505.16192 • Published May 22, 2025 • 12
Training-Free Reasoning and Reflection in MLLMs

Paper • 2505.16151 • Published May 22, 2025 • 9
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Paper • 2505.20256 • Published May 26, 2025 • 19
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Paper • 2505.13426 • Published May 19, 2025 • 13
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Paper • 2505.15804 • Published May 21, 2025 • 10
Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Paper • 2505.19084 • Published May 25, 2025 • 20
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Paper • 2505.19000 • Published May 25, 2025 • 42
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Paper • 2505.21374 • Published May 27, 2025 • 28
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Paper • 2505.21457 • Published May 27, 2025 • 16
Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Paper • 2505.17952 • Published May 23, 2025 • 20
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Paper • 2505.16673 • Published May 22, 2025 • 2
Sherlock: Self-Correcting Reasoning in Vision-Language Models

Paper • 2505.22651 • Published May 28, 2025 • 48
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Paper • 2505.22334 • Published May 28, 2025 • 36
Fostering Video Reasoning via Next-Event Prediction

Paper • 2505.22457 • Published May 28, 2025 • 29
Thinking with Generated Images

Paper • 2505.22525 • Published May 28, 2025 • 15
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published May 29, 2025 • 69
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Paper • 2505.22914 • Published May 28, 2025 • 39
Grounded Reinforcement Learning for Visual Reasoning

Paper • 2505.23678 • Published May 29, 2025 • 2
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Paper • 2506.01713 • Published Jun 2, 2025 • 48
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4, 2025 • 48
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Paper • 2506.05328 • Published Jun 5, 2025 • 21
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Paper • 2506.04559 • Published Jun 5, 2025 • 2
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

Paper • 2506.04614 • Published Jun 5, 2025 • 19
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Paper • 2506.09790 • Published Jun 11, 2025 • 53
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Paper • 2506.07464 • Published Jun 9, 2025 • 14
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Paper • 2506.13654 • Published Jun 16, 2025 • 43
VGR: Visual Grounded Reasoning

Paper • 2506.11991 • Published Jun 13, 2025 • 20
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Paper • 2506.16962 • Published Jun 20, 2025 • 10
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Paper • 2506.16141 • Published Jun 19, 2025 • 27
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Paper • 2506.21448 • Published Jun 26, 2025 • 9
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Paper • 2507.01006 • Published Jul 1, 2025 • 253
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Paper • 2506.21277 • Published Jun 26, 2025 • 14
Kwai Keye-VL Technical Report

Paper • 2507.01949 • Published Jul 2, 2025 • 132
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Paper • 2506.23918 • Published Jun 30, 2025 • 90
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Paper • 2507.05920 • Published Jul 8, 2025 • 12
Perception-Aware Policy Optimization for Multimodal Reasoning

Paper • 2507.06448 • Published Jul 8, 2025 • 48
Skywork-R1V3 Technical Report

Paper • 2507.06167 • Published Jul 8, 2025 • 74
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Paper • 2507.05255 • Published Jul 7, 2025 • 75
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper • 2507.13348 • Published Jul 17, 2025 • 79
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Paper • 2507.16815 • Published Jul 22, 2025 • 42
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Paper • 2507.16814 • Published Jul 22, 2025 • 21
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Paper • 2507.22607 • Published Jul 30, 2025 • 47
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 50
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published Aug 11, 2025 • 45
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

Paper • 2508.10433 • Published Aug 14, 2025 • 146
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

Paper • 2508.10576 • Published Aug 14, 2025 • 8
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Paper • 2508.21113 • Published Aug 28, 2025 • 110
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Paper • 2509.00676 • Published Aug 31, 2025 • 85
Planning with Reasoning using Vision Language World Model

Paper • 2509.02722 • Published Sep 2, 2025 • 24
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Paper • 2509.06461 • Published Sep 8, 2025 • 20
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Paper • 2509.12132 • Published Sep 15, 2025 • 7
Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Paper • 2509.06079 • Published Sep 7, 2025 • 6
BaseReward: A Strong Baseline for Multimodal Reward Model

Paper • 2509.16127 • Published Sep 19, 2025 • 21
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Paper • 2509.15566 • Published Sep 19, 2025 • 14
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Paper • 2509.14142 • Published Sep 17, 2025 • 10
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Paper • 2509.21268 • Published Sep 25, 2025 • 104
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Paper • 2509.25541 • Published Sep 29, 2025 • 141
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Paper • 2509.25848 • Published Sep 30, 2025 • 81
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Paper • 2510.01623 • Published Oct 2, 2025 • 12
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Paper • 2510.05034 • Published Oct 6, 2025 • 51
UniVideo: Unified Understanding, Generation, and Editing for Videos

Paper • 2510.08377 • Published Oct 9, 2025 • 81
TTRV: Test-Time Reinforcement Learning for Vision Language Models

Paper • 2510.06783 • Published Oct 8, 2025 • 13
Generative Universal Verifier as Multimodal Meta-Reasoner

Paper • 2510.13804 • Published Oct 15, 2025 • 27
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Paper • 2510.20579 • Published Oct 23, 2025 • 56
Directional Reasoning Injection for Fine-Tuning MLLMs

Paper • 2510.15050 • Published Oct 16, 2025 • 12
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Paper • 2510.23473 • Published Oct 27, 2025 • 86
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Paper • 2510.25092 • Published Oct 29, 2025 • 8
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Paper • 2510.23451 • Published Oct 27, 2025 • 28
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Paper • 2511.02779 • Published Nov 4, 2025 • 60
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published Nov 6, 2025 • 242
V-Thinker: Interactive Thinking with Images

Paper • 2511.04460 • Published Nov 6, 2025 • 98
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

Paper • 2511.06805 • Published Nov 10, 2025 • 13
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Paper • 2511.13026 • Published Nov 17, 2025 • 26
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Paper • 2511.16671 • Published Nov 20, 2025 • 16
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Paper • 2511.18373 • Published Nov 23, 2025 • 7
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Paper • 2511.16334 • Published Nov 20, 2025 • 96
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Paper • 2511.15705 • Published Nov 19, 2025 • 98
SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Paper • 2511.20814 • Published Nov 25, 2025 • 2
Think Visually, Reason Textually: Vision-Language Synergy in ARC

Paper • 2511.15703 • Published Nov 19, 2025 • 9
MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Paper • 2511.21087 • Published Nov 26, 2025 • 10
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Paper • 2511.22625 • Published Nov 27, 2025 • 48
Geometrically-Constrained Agent for Spatial Reasoning

Paper • 2511.22659 • Published Nov 27, 2025 • 41
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Paper • 2511.22134 • Published Nov 27, 2025 • 22
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Paper • 2512.02395 • Published Dec 2, 2025 • 51
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Paper • 2511.22586 • Published Nov 27, 2025 • 7
Artemis: Structured Visual Reasoning for Perception Policy Learning

Paper • 2512.01988 • Published Dec 1, 2025 • 2
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Paper • 2511.19661 • Published Nov 24, 2025 • 3
OneThinker: All-in-one Reasoning Model for Image and Video

Paper • 2512.03043 • Published Dec 2, 2025 • 34
Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Paper • 2512.03746 • Published Dec 3, 2025 • 17
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Paper • 2512.05111 • Published Dec 4, 2025 • 50
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Paper • 2512.03667 • Published Dec 3, 2025 • 6
Rethinking Chain-of-Thought Reasoning for Videos

Paper • 2512.09616 • Published Dec 10, 2025 • 19
VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Paper • 2512.06373 • Published Dec 6, 2025 • 9
Thinking with Images via Self-Calling Agent

Paper • 2512.08511 • Published Dec 9, 2025 • 23
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Paper • 2512.17532 • Published Dec 19, 2025 • 68
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Paper • 2512.12623 • Published Dec 14, 2025 • 4
MMGR: Multi-Modal Generative Reasoning

Paper • 2512.14691 • Published Dec 16, 2025 • 121
Latent Implicit Visual Reasoning

Paper • 2512.21218 • Published Dec 24, 2025 • 70
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Paper • 2512.22120 • Published Dec 26, 2025 • 15
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Paper • 2512.18745 • Published Dec 21, 2025 • 12
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Paper • 2601.05175 • Published Jan 8 • 36
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Paper • 2601.06803 • Published Jan 11 • 10
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Paper • 2601.09536 • Published Jan 14 • 5
Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Paper • 2601.10477 • Published Jan 15 • 156
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Paper • 2601.10129 • Published Jan 15 • 13
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Paper • 2601.13976 • Published Jan 20 • 22
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Paper • 2601.14750 • Published Jan 21 • 17
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Paper • 2601.15224 • Published Jan 21 • 12
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Paper • 2601.21821 • Published Jan 29 • 62
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Paper • 2601.22069 • Published Jan 29 • 7
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Paper • 2602.02453 • Published Feb 2 • 36
Training Data Efficiency in Multimodal Process Reward Models

Paper • 2602.04145 • Published Feb 4 • 79
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Paper • 2602.06040 • Published Feb 5 • 10

Upvote

Collection guide
Browse collections