arxiv:2605.21573

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Published on May 20

· Submitted by

Jinjing Zhao on May 25

#3 Paper of the day

Microsoft

Upvote

Authors:

Fangyun Wei ,

Yang Yue ,

Xiuyu Wu ,

Yunuo Chen

Abstract

Lens is a compact 3.8B-parameter text-to-image model achieving superior performance with reduced training compute through dense caption datasets, multi-resolution batching, efficient architecture, and optimization techniques.

AI-generated summary

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

View arXiv page View PDF Project page GitHub 113 Add to collection

Community

Jinjing713

Paper submitter about 18 hours ago

•

edited about 17 hours ago

Lens is a 3.8B-parameter foundational text-to-image model designed for efficient training and fast high-resolution generation. It combines dense-caption pre-training, mixed-resolution learning, GPT-OSS multi-layer text features, and the FLUX.2 semantic VAE to reach competitive quality with substantially less training compute than larger T2I models.

Highlights

Fully open-sourced model suite: Lens-Base, Lens-RL, and Lens-Turbo are released, covering both a 20-step high-quality version and a 4-step fast-inference version.
Fully transparent technical details: We disclose the complete pipeline, including data construction, pre-training, RL post-training, Reasoner design, distillation acceleration, inference configuration, and ablation analysis.
High training efficiency: Lens is trained with 128 A100 GPUs, requiring only about 19.3% of the training cost of Z-Image.
State-of-the-art performance: Lens achieves leading results on multiple benchmarks, including OneIG, GenEval, LongText, and CVTG.
Fast inference speed: Lens generates a 1024-resolution image in only 3.15s on a single H100 GPU, while Lens-Turbo reduces this to 0.84s.
Flexible generation capability: Lens supports up to 1440-resolution generation, arbitrary aspect ratios from 1:2 to 2:1, multilingual prompts, and automatic prompt enhancement through the Reasoner.