I traced one sentence through a Transformer and showed every number.
You've seen nanoGPT. You've seen minGPT. You've even seen picoGPT.
But have you seen every number in a Transformer connected to the actual words? π€
TraceGPT is a Transformer where every matrix has word labels, every attention weight shows which word cares about which, and every prediction traces back to meaning. Pure Python + NumPy, zero PyTorch.
"the cat sat" β embeddings β attention β prediction β "on"
β β β β
words "cat"=0.9 "cat"β"sat" "on"=0.45
animal attention=0.40
What if every number in a Transformer had a name?
π Interactive Demo β Trace a sentence in your browser, no install needed.
python -m levels.level6_showcaseKey Output:
animal action location size emotion grammar time concrete
"the" β 0.100 0.000 0.000 0.000 0.000 0.900 0.000 0.000 β
"cat" β 0.900 0.100 0.000 0.300 0.200 0.000 0.000 0.800 β
"sat" β 0.100 0.900 0.100 0.000 0.000 0.000 0.300 0.100 β
π‘ Reading this:
- "cat" = high animal (0.9) + concrete (0.8) β it's an animal!
- "sat" = high action (0.9) β it's a verb!
- "the" = high grammar (0.9) β it's a grammar word!
Attention Weights: "the cat sat"
the cat sat
ββββββββββββββββββββββββ
the ββββββ β 1.00 0.00 0.00 β "the" sees only itself
cat ββββββ βββββ β 0.38 0.62 0.00 β "cat" looks at itself most
sat ββββββ βββββ βββββ β 0.29 0.40 0.31 β "sat" looks at "cat"! π€―
π‘ The model "knows" a cat is sitting! "sat" pays most attention to "cat" (40%).
After "the cat sat" β What comes next?
dog ββββββββββββββββββββββββββββββ 0.2585 β predicted
cat ββββββββββββββββββββββββββββββ 0.1516
park ββββββββββββββββββββββββββββββ 0.1173
mat ββββββββββββββββββββββββββββββ 0.1027
The last word's output is most similar to "dog" (1.7305) β both are animals!
π Full trace: reports/level6_showcase.md
nanoGPT gives you working code. picoGPT gives you tiny code. TraceGPT gives you understanding:
- Words, not just numbers. Every matrix is labeled with actual tokens. You see "cat" = high animal, not just
[0.9, 0.1, 0, ...]. - Attention heatmaps with words. See which word attends to which β "sat" looks at "cat" the most.
- No PyTorch. Pure NumPy. Every operation is transparent.
- Interactive web demo. Try it in your browser, zero install.
- Hand-verifiable. Tiny matrices you can check with a calculator.
- Bug library. 7 common Transformer bugs with wrong/correct implementations and tests.
- Full GPT model. Complete TinyGPT with multi-head attention and autoregressive generation.
TraceGPT is for anyone who has stared at a matrix multiplication and thought: "but what do these numbers actually MEAN?"
git clone https://github.com/YuanyuanMa03/TraceGPT.git
cd TraceGPT
pip install -e .
# Run Level 6 (showcase)
python -m levels.level6_showcase
# Run all tests
pytest tests/ bugs/ -vTraceGPT/
βββ tracegpt/ # Core library (pure NumPy)
β βββ tracer.py # Traces every operation
β βββ ops.py # Core ops with explanations
β βββ report.py # Markdown report generation
βββ levels/ # 6 progressive learning levels
β βββ level0...6 # From embedding β full GPT
β βββ level6_showcase.py # β The killer demo
βββ bugs/ # 7 common Transformer bugs
βββ tests/ # 87 tests, all passing
- Level 0: Embedding β prediction
- Level 1: Causal self-attention
- Level 2: Transformer block
- Level 3: Positional encoding
- Level 4: Multi-head attention
- Level 5: Full GPT + generation
- Level 6: β Showcase β word-level traces
| Feature | nanoGPT | picoGPT | TraceGPT |
|---|---|---|---|
| Pure NumPy | β PyTorch | β | β |
| Word-labeled matrices | β | β | β |
| Attention heatmaps with words | β | β | β |
| Interactive demo | β | β | β |
| Bug library | β | β | β |
| Hand-verifiable | β | β | β |
7 common Transformer bugs with wrong/correct code + tests:
- Softmax on wrong axis
- Causal mask reversed
- Missing βd_k scaling
- Wrong QΒ·K transpose
- Label shift bug
- Weight tying transpose error
- Generation loop truncation
- Readability over performance. No optimizations that obscure understanding.
- Traced, not hidden. Every operation is recorded and explainable.
- Hand-verifiable. Tiny matrices you can check with a calculator.
- Bugs are lessons. Common mistakes are first-class citizens.
- No magic. No framework abstractions between you and the math.
- Python + NumPy only. No PyTorch, TensorFlow, or JAX.
- No performance optimization. Clarity is the only metric.
- Every op exposes: formula, inputs, output, shape, explanation.
- All examples use tiny matrices (typically 3Γ4 or smaller).
- v0.4: Word-level traces, attention heatmaps, interactive demo
- v0.5: Training loop with backprop (educational)
- v1.0: Paper, documentation website
MIT License. See LICENSE.
If you've ever stared at a Transformer and thought "but what do these numbers actually MEAN?" β