Skip to content

sanyalsunny111/LLM-Inheritune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

License: MIT arXiv

Inheritune Logo


Overview

This paper was formerly known as

  • Inheritune: Training Smaller Yet More Attentive Language Models (v2)
  • Pre-training Small Base LMs with Fewer Tokens (v1)

Inheritune is an efficient training method for developing smaller, high-performing language models by inheriting knowledge from larger pre-trained models. Our approach addresses a critical inefficiency in standard transformer architectures: attention matrices in deeper layers often degenerate to single-column patterns, creating "lazy layers" that contribute minimally to model performance.

This paper has been published as a journal version in TMLR and was reviewed on OpenReview.

Selected Impacts

  • 📌 Used in the wild: This paper is used by 21 Hugging Face models (HuggingFace page).
  • 🌍 Notable adoption: Gumini (multilingual) English and Korean models are trained following the Inheritune recipe (Gumini report).
  • 📰 Media feature: Covered by Marktechpost (article).
  • 🎥 Video tutorial: A YouTube tutorial by Prince Canuma is available (link)

Repository Structure

.
├── GPT2-experiments/   # GPT-2 training/analysis experiments for the paper
├── lit-gpt/            # Code adapted from lit-gpt / small-LM training utilities
├── analysis/           # Attention rank computation, softmax analysis, plotting scripts
├── images/             # Figures used in the README / paper artifacts
├── README.md           # Main project documentation
├── poster.pdf          # Project poster (PDF)
└── attention-collapse-demo/ # toy examples of attention collapse in simple setting


The Problem: Attention Collapse

The rank of many attention matrices collapses to near rank-1 during training. Refer demo notebook.

Rank Analysis
Figure. 1: Toy example of rank collapse in pure self-attention network.

Attention collapse in GPT models

Rank Analysis
Figure. 2: Rank analysis of GPT-2 Medium (355M).

Attention collapse in LLaMA-3 8B

Rank Analysis
Figure. 3: Rank analysis of LLaMA-3 8B reveals that nearly half of the attention heads (in red) exhibit rank collapse.


Proposed Algorithm: Inheritune

Inheritune follows a simple three-step process:

  1. Inherit: Copy potent early transformer layers from a larger pre-trained model.
  2. Train: Continually train the inherited layers on the pre-train dataset.
  3. Expand: Progressively grow the model until reaching desired performance.

This approach leverages the knowledge already captured in early layers while eliminating lazy deeper layers.

Rank Analysis
Figure. 4: Overview of the Inheritune training recipe using a 24-Layer GPT-2 Medium model example.


📊 Selected Results

GPT-2 XLarge (1.5B) on OpenWebText-9B

Our 24-layer variant converges faster and matches the validation loss of the full 48-layer model despite being 50% smaller.

Rank Analysis
Figure. 5: GPT-2 XLarge variants derived using Inheritune converge faster and match the final validation loss of the full-sized model, despite having much fewer layers.

Task Full model (48 layers) Ours (24 layers)
Accuracy-based tasks (↑)
ARC-E 50.38 51.22
PIQA 66.70 66.87
SciQ 77.00 79.20
HellaSwag 33.65 34.20
LAMBADA 39.90 43.30
WinoGrande 51.93 53.28
BoolQ 57.86 60.40
Average 53.92 55.50
Perplexity-based tasks (↓)
Wikitext 25.46 25.52
LAMBADA 20.24 16.51
Average 22.85 21.01

Table 1. Downstream evaluation of GPT-2 XLarge (1.5B) trained from scratch vs. a 24-layer model trained with Inheritune (Ours).

GPT-2 Large (680M) on FineWeb-Edu

Rank Analysis
Figure 6: GPT-2 Large variants derived using Inheritune converge faster and match the final validation loss of the full-sized model, despite having much fewer layers.

Method Layers ARCE PIQA SciQ Hellaswag Lambada Avg
Random Init 32 52.48 64.58 75.3 32.65 22.2 49.44
Random Init 16 50.34 63.11 75.0 30.86 21.56 48.17
Inheritune 16 52.9 63.55 76.1 32.14 24.06 49.75

Table 2. Downstream evaluation of GPT-2 Large (680M) trained from scratch vs. a 16-layer model trained with Inheritune (Ours).


Citation

If you find this work helpful, please consider citing us:

@article{
sanyal2026when,
title={When Attention Collapses: How Degenerate Layers in {LLM}s Enable Smaller, Stronger Models},
author={Sunny Sanyal and Ravid Shwartz-Ziv and Alex Dimakis and sujay sanghavi},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=2zQn0bUoPf},
note={}
}

 

Star History

Star History Chart

Acknowledgement

The training code for small language model 1B-2B is mainly adapted from litgpt. The code for GPT2 experiments are mainly adapted from Sophia and nanoGPT. The llama image was created using DALLE.

About

[TMLR 2025] When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors