Abstract
LLaVA-CoT is a vision-language model that achieves improved reasoning performance through structured multistage processing and test-time scaling, outperforming larger models with a smaller training dataset.
Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.
Community
In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning like GPT-o1. Our 11B model outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The key is training on structured data and a novel inference time scaling method—stage-level beam search
congrats, would be great to upload the model, here is the guide: https://huggingface.co/docs/hub/models-uploading
Paper summary is here: https://www.aimodels.fyi/papers/arxiv/llava-o1-let-vision-language-models-reason
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (2024)
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning (2024)
- Vision-Language Models Can Self-Improve Reasoning via Reflection (2024)
- Large Language Models Can Self-Improve in Long-context Reasoning (2024)
- Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2411.10440 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash