Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending)#490
Draft
amaljithkuttamath wants to merge 4 commits intoopenai:mainfrom
Draft
Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending)#490amaljithkuttamath wants to merge 4 commits intoopenai:mainfrom
amaljithkuttamath wants to merge 4 commits intoopenai:mainfrom
Conversation
Value Residual + Gated Attention on PR openai#442 stack. Single seed (1337), 8xH100 SXM, 14.2 MB artifact.
sahiee-dev
added a commit
to sahiee-dev/parameter-golf
that referenced
this pull request
Mar 23, 2026
Full stack on thwu1 base (1.1428): - Value Residual: lambda_v * v0 shortcut to every block, init=0 - Gated Attention: learned scalar gate on attn output, init=1 - XSA: orthogonal self-value removal, last 4 layers - EMA: decay=0.9999 shadow model used at final eval - AdamW TTT: lr=0.001, 3 epochs on val tokens before eval - BigramHash(10240): restored to full size after ablation Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887). Expected range: 1.08-1.10 on 8xH100s. Trigram ablation confirmed negative at small scale — removed.
Pre-eval TTT was non-compliant per issue openai#402. Now uses score-first TTT: score each chunk before training on it. Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535). Score pending rerun with compute credits.
XSA + Partial RoPE + LN Scale + Late QAT + GPTQ + score-first TTT with temp calibration. Untested, needs 1xH100 validation.
- Deep-clone state dict before bf16 calibration cast (was silently corrupting fp32 weights to bf16 before GPTQ, causing 0.328 bpb gap) - Keep tok_emb.weight as fp16 passthrough instead of int8 quantization - Fix TTT eval: keep CastedLinear weights in fp32 for stable AdamW - Remove torch.compile from TTT chunk loop (re-compilation + weight mutation = crash) - Add quant diagnostics: GPTQ key matching + per-layer error stats
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb = pending rerun | 8xH100 SXM | Legal score-first TTT
Architecture improvements on the standard 11L stack: Value Residual + Gated Attention + LeakyReLU(0.5)², with legal score-first TTT.
Update (Mar 23)
Initial submission used pre-eval TTT (non-compliant per #402). This update switches to legal score-first TTT: score each validation chunk before training on it. Score pending rerun with compute credits.
Architecture
VR+GA ablated in PR #413 (-0.017 bpb combined). LeakyReLU² proven by PR #569 and #535.
Run command
All hyperparameters set as defaults.
Credits