Skip to content

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456

Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L
Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 22, 2026

Legal Score-First TTT (10L, 1.1532 BPB)

10-layer GPT with competition-legal score-first full-model test-time training,
mixed int5/int6 quantization, and community-standard architecture components.

Metric Value
val_bpb 1.15321496
Pre-TTT val_bpb 1.1600
Training 5,200 steps, 2,283 s on 4×A100-40GB
Eval + TTT 458 s
Artifact 15,980,085 / 16,000,000 bytes

What's novel

The main contribution is competition-legal full-model TTT integrated into
sliding-window evaluation. Prior legal TTT work (PR #77) used per-document
LoRA adapters with resets. This submission replaces that with a chunked
score-first loop over all 25.5 M parameters — no LoRA, no adapter resets
between documents — giving the model persistent memory across the entire
validation set.

eval_val_sliding_ttt() divides validation into 32 k-token chunks, scores
each chunk first (satisfying the "already graded" rule), then trains with one
AdamW step per chunk. Cosine LR decay across chunks prevents catastrophic
forgetting. Improvement: 1.1600 → 1.1532 BPB (−0.0068).

Architecture summary

10 layers, d_model=512, 8 heads / 4 KV heads (GQA 2:1), 3× relu² MLP,
BigramHash(10 240), SmearGate, XSA on last 3 layers, U-Net skip connections.
Depth recurrence infrastructure exists in the code but is not active
(unique_layers = num_layers = 10).

Training recipe

Muon + AdamW, lr 0.025/0.035/0.025 (matrices/embeddings/scalars),
786 432 tokens/step, 20 warmup → 3 000 warmdown, SWA from step 4 650,
Late QAT, GPTQ-lite on 75 % of layers, zstd-22 compression.

TTT details

  • Score-first chunked loop (32 768 tokens/chunk, 1 epoch each)
  • AdamW lr=0.0005, full model unfrozen, cosine decay across chunks
  • Persistent adaptation (no resets between documents)

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

…10L)

- 10-layer GPT with depth recurrence, BigramHash, SmearGate, XSA, U-Net skips
- Mixed int5/int6 quantization + zstd-22 compression (15.9MB artifact)
- Competition-legal score-first TTT: scores each chunk before training on it
- val_bpb: 1.1532 (pre-TTT: 1.1600)
- Trained on 4xA100-40GB, 5200 steps, 2283s training + 458s eval
@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/depth-recurrence-legal-ttt-10L branch from ec44c24 to f5e802b Compare March 23, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant