Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) by Christopher-Lee-McClendon · Pull Request #456 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-22T20:02:06Z

Legal Score-First TTT (10L, 1.1532 BPB)

10-layer GPT with competition-legal score-first full-model test-time training,
mixed int5/int6 quantization, and community-standard architecture components.

Metric	Value
val_bpb	1.15321496
Pre-TTT val_bpb	1.1600
Training	5,200 steps, 2,283 s on 4×A100-40GB
Eval + TTT	458 s
Artifact	15,980,085 / 16,000,000 bytes

What's novel

The main contribution is competition-legal full-model TTT integrated into
sliding-window evaluation. Prior legal TTT work (PR #77) used per-document
LoRA adapters with resets. This submission replaces that with a chunked
score-first loop over all 25.5 M parameters — no LoRA, no adapter resets
between documents — giving the model persistent memory across the entire
validation set.

eval_val_sliding_ttt() divides validation into 32 k-token chunks, scores
each chunk first (satisfying the "already graded" rule), then trains with one
AdamW step per chunk. Cosine LR decay across chunks prevents catastrophic
forgetting. Improvement: 1.1600 → 1.1532 BPB (−0.0068).

Architecture summary

10 layers, d_model=512, 8 heads / 4 KV heads (GQA 2:1), 3× relu² MLP,
BigramHash(10 240), SmearGate, XSA on last 3 layers, U-Net skip connections.
Depth recurrence infrastructure exists in the code but is not active
(unique_layers = num_layers = 10).

Training recipe

Muon + AdamW, lr 0.025/0.035/0.025 (matrices/embeddings/scalars),
786 432 tokens/step, 20 warmup → 3 000 warmdown, SWA from step 4 650,
Late QAT, GPTQ-lite on 75 % of layers, zstd-22 compression.

TTT details

Score-first chunked loop (32 768 tokens/chunk, 1 epoch each)
AdamW lr=0.0005, full model unfrozen, cosine decay across chunks
Persistent adaptation (no resets between documents)

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Muon optimizer — Baseline (modded-nanogpt); Newton-Schulz orthogonal preconditioning
BigramHash embeddings — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
SmearGate — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
XSA — PR [Closed] EMA + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) #187 (Idan3011); GQA-aware variant in PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (unnir)
U-Net skip connections — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman), PR SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) #69 (TevBenji)
Mixed int5/int6 quantization — PR 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) #76 (unixmadtoonslab / Will DePue)
SWA — PR SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) #69 (TevBenji)
Late QAT — PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (jfprincz), PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 (unnir)
Sliding window evaluation — PR Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50 (mattqlf / Matthew Li)
Legal TTT framework — PR [record bpb=1.195] sliding window + LoRA TTT #77 (samacqua); full-model variant is our novel contribution
ReLU², GQA — Baseline (modded-nanogpt)

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

…10L) - 10-layer GPT with depth recurrence, BigramHash, SmearGate, XSA, U-Net skips - Mixed int5/int6 quantization + zstd-22 compression (15.9MB artifact) - Competition-legal score-first TTT: scores each chunk before training on it - val_bpb: 1.1532 (pre-TTT: 1.1600) - Trained on 4xA100-40GB, 5200 steps, 2283s training + 458s eval

Christopher-Lee-McClendon mentioned this pull request Mar 22, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

Tighten README and submission metadata framing

f5e802b

Christopher-Lee-McClendon force-pushed the submission/depth-recurrence-legal-ttt-10L branch from ec44c24 to f5e802b Compare March 23, 2026 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L

Christopher-Lee-McClendon commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Lee-McClendon commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Legal Score-First TTT (10L, 1.1532 BPB)

What's novel

Architecture summary

Training recipe

TTT details

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Christopher-Lee-McClendon commented Mar 22, 2026 •

edited

Loading