Papers
arxiv:2411.10440

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Published on Jul 21, 2025
· Submitted by
Guowei Xu
on Nov 17, 2024
#1 Paper of the day
Authors:
,
,

Abstract

LLaVA-CoT is a vision-language model that achieves improved reasoning performance through structured multistage processing and test-time scaling, outperforming larger models with a smaller training dataset.

AI-generated summary

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

Community

Paper author Paper submitter

In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning like GPT-o1. Our 11B model outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The key is training on structured data and a novel inference time scaling method—stage-level beam search

·

congrats, would be great to upload the model, here is the guide: https://huggingface.co/docs/hub/models-uploading

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2411.10440
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 4

Collections including this paper 41