Abstract
GenLIP is a minimalist generative pretraining framework for Vision Transformers that directly predicts language tokens from visual tokens using language modeling, offering simplicity, scalability, and competitive performance in multimodal tasks.
In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
Community
that gated attention trick to curb attention sink in a single, concatenated vision+text transformer is the most interesting nugget here. by modulating attention outputs per token, it lets image tokens attend bidirectionally while text tokens generate causally, which feels like the right compatibility bridge to LLMs. i'd love to see an ablation on gating strength across data scales to confirm it's the main driver rather than a stability hack. the arXivLens breakdown helped me parse the method details; if you want a quick walkthrough, check https://arxivlens.com/PaperView/Details/let-vit-speak-generative-language-image-pre-training-4995-e72acd39. overall this supports the claim that a minimalist, data-efficient ViT can stand in for more complex multi-tower setups.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings (2026)
- Hierarchical Pre-Training of Vision Encoders with Large Language Models (2026)
- CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning (2026)
- CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models (2026)
- Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation (2026)
- Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders (2026)
- GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00809 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper