Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456
Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Conversation
…10L) - 10-layer GPT with depth recurrence, BigramHash, SmearGate, XSA, U-Net skips - Mixed int5/int6 quantization + zstd-22 compression (15.9MB artifact) - Competition-legal score-first TTT: scores each chunk before training on it - val_bpb: 1.1532 (pre-TTT: 1.1600) - Trained on 4xA100-40GB, 5200 steps, 2283s training + 458s eval
ec44c24 to
f5e802b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Legal Score-First TTT (10L, 1.1532 BPB)
10-layer GPT with competition-legal score-first full-model test-time training,
mixed int5/int6 quantization, and community-standard architecture components.
What's novel
The main contribution is competition-legal full-model TTT integrated into
sliding-window evaluation. Prior legal TTT work (PR #77) used per-document
LoRA adapters with resets. This submission replaces that with a chunked
score-first loop over all 25.5 M parameters — no LoRA, no adapter resets
between documents — giving the model persistent memory across the entire
validation set.
eval_val_sliding_ttt()divides validation into 32 k-token chunks, scoreseach chunk first (satisfying the "already graded" rule), then trains with one
AdamW step per chunk. Cosine LR decay across chunks prevents catastrophic
forgetting. Improvement: 1.1600 → 1.1532 BPB (−0.0068).
Architecture summary
10 layers, d_model=512, 8 heads / 4 KV heads (GQA 2:1), 3× relu² MLP,
BigramHash(10 240), SmearGate, XSA on last 3 layers, U-Net skip connections.
Depth recurrence infrastructure exists in the code but is not active
(
unique_layers = num_layers = 10).Training recipe
Muon + AdamW, lr 0.025/0.035/0.025 (matrices/embeddings/scalars),
786 432 tokens/step, 20 warmup → 3 000 warmdown, SWA from step 4 650,
Late QAT, GPTQ-lite on 75 % of layers, zstd-22 compression.
TTT details
Credits
This submission builds on work from many contributors to the parameter-golf competition:
Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.