A complete transformer implementation from scratch using only NumPy — no PyTorch, no TensorFlow. Every matrix multiplication, every gradient, every optimization step is explicit and inspectable.
Built for learning and interpretability research: understand exactly how transformers work by building one from raw math.
- Full GPT-architecture transformer — token embeddings, positional encoding, multi-head attention, feed-forward networks, layer norm, residual connections
- Complete backpropagation — hand-derived gradients for every component
- Adam optimizer — from scratch with weight decay, warmup, cosine scheduling
- Interpretability probes — attention pattern analysis, head classification, induction head detection, logit attribution, activation caching
- Training pipeline — cross-entropy loss, gradient clipping, perplexity tracking
Token IDs → Embedding → Positional Encoding
→ [TransformerBlock × N]
→ LayerNorm → MultiHeadAttention → Residual
→ LayerNorm → FeedForward (GELU) → Residual
→ Final LayerNorm → Output Projection → Logits
Every component has .forward(), .backward(), and .parameters — fully differentiable, fully inspectable.
Classify attention heads by behavior:
- Positional heads — attend to fixed offsets (previous token, etc.)
- Content heads — attend based on token meaning
- Induction heads — implement in-context learning by copying patterns
Decompose a prediction into per-layer contributions. Which transformer block is responsible for predicting the next token?
Record intermediate activations at every layer. Track residual stream norms, detect dead neurons, measure saturation.
git clone https://github.com/BabyChrist666/transformer-lab.git
cd transformer-lab
pip install -r requirements.txt
# Run tests (61 passing)
pytest tests/ -v
# Run the full experiment: train + generate + analyze
python -m experiments.train_and_analyzeTraining a 152K-parameter, 3-layer, 4-head transformer on Shakespeare:
| Metric | Value |
|---|---|
| Parameters | 152,320 |
| Final Loss | 3.24 |
| Final Perplexity | 25.5 |
| Training Time | 10.4s |
| Layer | Head | Type | Entropy | Avg Distance |
|---|---|---|---|---|
| 0 | 0 | mixed | 0.511 | 18.0 |
| 0 | 1 | mixed | 0.906 | 9.0 |
| 1 | 0 | content | 4.006 | 9.5 |
| 1 | 1 | content | 3.976 | 9.0 |
| 2 | 0 | content | 3.999 | 9.4 |
| 2 | 1 | content | 4.011 | 9.8 |
Layer 0 heads are focused (low entropy, mixed behavior), while layers 1-2 develop broad content-based attention.
For predicting 'b' at position 6 in "To be, or not to be":
- Block 1 contributes +4.80 (promotes correct prediction)
- Block 0 contributes -8.95 (suppresses)
- Block 2 contributes -3.99 (fine-tunes)
transformer_lab/
├── attention.py # Multi-head attention + causal masking
├── embeddings.py # Token embeddings + sinusoidal positional encoding
├── model.py # Full transformer + LayerNorm + FeedForward
├── trainer.py # Training loop + Adam optimizer + loss
└── probe.py # Interpretability: attention probes, activation cache, logit attribution
experiments/
└── train_and_analyze.py # Full training + generation + analysis pipeline
tests/ # 61 tests
├── test_attention.py
├── test_model.py
├── test_probe.py
└── test_trainer.py
Using PyTorch or TensorFlow hides the mechanics behind abstractions. This implementation makes every operation visible:
- See exactly how
Q @ K.T / sqrt(d_k)computes attention scores - Trace gradients through softmax, layer norm, and GELU by hand
- Understand why residual connections and layer norm matter for training stability
- Inspect how each attention head develops different behaviors
- Python 3.10+
- NumPy — all matrix operations
- pytest — 61 tests
MIT