arxiv:2605.00809

Let ViT Speak: Generative Language-Image Pre-training

Published on May 1

· Submitted by

taesiri on May 4

ByteDance

Upvote

Authors:

Abstract

GenLIP is a minimalist generative pretraining framework for Vision Transformers that directly predicts language tokens from visual tokens using language modeling, offering simplicity, scalability, and competitive performance in multimodal tasks.

AI-generated summary

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

View arXiv page View PDF Project page GitHub 16 Add to collection

Community

avahal

about 5 hours ago

that gated attention trick to curb attention sink in a single, concatenated vision+text transformer is the most interesting nugget here. by modulating attention outputs per token, it lets image tokens attend bidirectionally while text tokens generate causally, which feels like the right compatibility bridge to LLMs. i'd love to see an ablation on gating strength across data scales to confirm it's the main driver rather than a stability hack. the arXivLens breakdown helped me parse the method details; if you want a quick walkthrough, check https://arxivlens.com/PaperView/Details/let-vit-speak-generative-language-image-pre-training-4995-e72acd39. overall this supports the claim that a minimalist, data-efficient ViT can stand in for more complex multi-tower setups.