Papers
arxiv:2507.07129

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Published on Jul 8, 2025
· Submitted by
Andrey
on Jul 11, 2025
Authors:

Abstract

Transformers with frozen embeddings enable efficient scaling through modular composition and layer-wise growth, improving performance on reasoning tasks without catastrophic forgetting.

AI-generated summary

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

Community

Paper author Paper submitter
edited 5 days ago

How does an LLM understand the meaning of 'wRiTe' when its building blocks—the individual character tokens 'w', 'R', 'i'—have no semantic content? This simple question challenges the very foundation of modern AI.
Our paper argues that high-level meaning is not contained in embeddings, but is constructed by the Transformer architecture. We prove this by replacing standard trainable embeddings with a completely frozen layer derived from the raw visual structure of Unicode glyphs. These non-semantic vectors are fixed before training even begins.

Paper author Paper submitter
edited 5 days ago

Building on our foundational paper (arXiv:2507.04886 https://huggingface.co/papers/2507.04886 ), we introduce "Constructive Learning." Our frozen, non-semantic embeddings act as a universal substrate, allowing us to grow models layer-by-layer.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2507.07129
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 16

Browse 16 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.07129 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.07129 in a Space README.md to link it from this page.

Collections including this paper 3