Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Abstract
Transformers with frozen embeddings enable efficient scaling through modular composition and layer-wise growth, improving performance on reasoning tasks without catastrophic forgetting.
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
Community
How does an LLM understand the meaning of 'wRiTe' when its building blocks—the individual character tokens 'w', 'R', 'i'—have no semantic content? This simple question challenges the very foundation of modern AI.
Our paper argues that high-level meaning is not contained in embeddings, but is constructed by the Transformer architecture. We prove this by replacing standard trainable embeddings with a completely frozen layer derived from the raw visual structure of Unicode glyphs. These non-semantic vectors are fixed before training even begins.
Building on our foundational paper (arXiv:2507.04886 https://huggingface.co/papers/2507.04886 ), we introduce "Constructive Learning." Our frozen, non-semantic embeddings act as a universal substrate, allowing us to grow models layer-by-layer.
Get this paper in your agent:
hf papers read 2507.07129 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 16
Browse 16 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper