This paper was formerly known as
- Inheritune: Training Smaller Yet More Attentive Language Models (v2)
- Pre-training Small Base LMs with Fewer Tokens (v1)
Inheritune is an efficient training method for developing smaller, high-performing language models by inheriting knowledge from larger pre-trained models. Our approach addresses a critical inefficiency in standard transformer architectures: attention matrices in deeper layers often degenerate to single-column patterns, creating "lazy layers" that contribute minimally to model performance.
This paper has been published as a journal version in TMLR and was reviewed on OpenReview.
- 📌 Used in the wild: This paper is used by 21 Hugging Face models (HuggingFace page).
- 🌍 Notable adoption: Gumini (multilingual) English and Korean models are trained following the Inheritune recipe (Gumini report).
- 📰 Media feature: Covered by Marktechpost (article).
- 🎥 Video tutorial: A YouTube tutorial by Prince Canuma is available (link)
.
├── GPT2-experiments/ # GPT-2 training/analysis experiments for the paper
├── lit-gpt/ # Code adapted from lit-gpt / small-LM training utilities
├── analysis/ # Attention rank computation, softmax analysis, plotting scripts
├── images/ # Figures used in the README / paper artifacts
├── README.md # Main project documentation
├── poster.pdf # Project poster (PDF)
└── attention-collapse-demo/ # toy examples of attention collapse in simple setting
The rank of many attention matrices collapses to near rank-1 during training. Refer demo notebook.
Figure. 1: Toy example of rank collapse in pure self-attention network.
Figure. 2: Rank analysis of GPT-2 Medium (355M).
Figure. 3: Rank analysis of LLaMA-3 8B reveals that nearly half
of the attention heads (in red) exhibit rank collapse.
Inheritune follows a simple three-step process:
- Inherit: Copy potent early transformer layers from a larger pre-trained model.
- Train: Continually train the inherited layers on the pre-train dataset.
- Expand: Progressively grow the model until reaching desired performance.
This approach leverages the knowledge already captured in early layers while eliminating lazy deeper layers.
Figure. 4: Overview of the Inheritune training recipe
using a 24-Layer GPT-2 Medium model example.
Our 24-layer variant converges faster and matches the validation loss of the full 48-layer model despite being 50% smaller.
Figure. 5: GPT-2 XLarge variants derived using Inheritune converge faster
and match the final validation loss of the full-sized model, despite having much fewer layers.
| Task | Full model (48 layers) | Ours (24 layers) |
|---|---|---|
| Accuracy-based tasks (↑) | ||
| ARC-E | 50.38 | 51.22 |
| PIQA | 66.70 | 66.87 |
| SciQ | 77.00 | 79.20 |
| HellaSwag | 33.65 | 34.20 |
| LAMBADA | 39.90 | 43.30 |
| WinoGrande | 51.93 | 53.28 |
| BoolQ | 57.86 | 60.40 |
| Average | 53.92 | 55.50 |
| Perplexity-based tasks (↓) | ||
| Wikitext | 25.46 | 25.52 |
| LAMBADA | 20.24 | 16.51 |
| Average | 22.85 | 21.01 |
Table 1. Downstream evaluation of GPT-2 XLarge (1.5B) trained from scratch vs. a 24-layer model trained with Inheritune (Ours).
Figure 6: GPT-2 Large† variants derived using Inheritune converge
faster and match the final validation loss of the full-sized model, despite having much fewer layers.
| Method | Layers | ARCE | PIQA | SciQ | Hellaswag | Lambada | Avg |
|---|---|---|---|---|---|---|---|
| Random Init | 32 | 52.48 | 64.58 | 75.3 | 32.65 | 22.2 | 49.44 |
| Random Init | 16 | 50.34 | 63.11 | 75.0 | 30.86 | 21.56 | 48.17 |
| Inheritune | 16 | 52.9 | 63.55 | 76.1 | 32.14 | 24.06 | 49.75 |
Table 2. Downstream evaluation of GPT-2 Large† (680M) trained from scratch vs. a 16-layer model trained with Inheritune (Ours).
If you find this work helpful, please consider citing us:
@article{
sanyal2026when,
title={When Attention Collapses: How Degenerate Layers in {LLM}s Enable Smaller, Stronger Models},
author={Sunny Sanyal and Ravid Shwartz-Ziv and Alex Dimakis and sujay sanghavi},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=2zQn0bUoPf},
note={}
}
The training code for small language model 1B-2B is mainly adapted from litgpt. The code for GPT2 experiments are mainly adapted from Sophia and nanoGPT. The llama image was created using DALLE.
