Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175
Open
anthony-maio wants to merge 2 commits intoopenai:mainfrom
Open
Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio wants to merge 2 commits intoopenai:mainfrom
Conversation
…toneInit) Combines the two strongest orthogonal improvements that haven't been stacked: - SOTA training (1.1748): 10L, Muon WD, FP16 tied embed, spectral init, sliding window - TTT eval (1.1928 on baseline): per-document LoRA adaptation, document isolation Modifies CausalSelfAttention, Block, and GPT forward methods to accept optional LoRA delta parameters while preserving default training behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds LoRA test-time training (TTT) evaluation on top of the existing SOTA training recipe in the records/track_10min_16mb track, aiming to report both standard sliding-window validation and per-document LoRA-adapted validation results.
Changes:
- Extends the model forward path to accept optional batched LoRA deltas (Q/V + LM head) used only during evaluation.
- Adds batched per-document LoRA adaptation and a document-isolated, chunked TTT evaluation routine.
- Runs sliding-window eval after int8+zlib roundtrip, then runs TTT LoRA eval and logs both metrics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py | Implements TTT LoRA adapters + evaluation and integrates it after the existing quantized sliding-window eval. |
| records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/submission.json | Adds submission metadata for the new record entry. |
| records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md | Documents the combined recipe, expected results, and reproduction command. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...s/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py
Outdated
Show resolved
Hide resolved
...s/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py
Outdated
Show resolved
Hide resolved
...s/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py
Outdated
Show resolved
Hide resolved
Replaces original TTT LoRA submission with current validated stack. 3-seed: 1.1234 / 1.1225 / 1.1228 = mean 1.1229 (std 0.0005) All artifacts under 16MB. All logs attached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.
Key Innovations
LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.
Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.
lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.
Architecture
PR #414 base + LeakyReLU² + VRL + lzma:
Credits
Test plan