Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
Open
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
Conversation
…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuaswarren
added a commit
to joshuaswarren/parameter-golf
that referenced
this pull request
Mar 22, 2026
…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)
|
catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT
val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)
3-Seed Results
Key Innovations
Catalytic Residual Connections (novel): Replace
x + f(x)withx + c * f(x), wherecis a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).
BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).
Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.
SWA: Stochastic weight averaging from last 20% of warmdown.
Run Command
All parameters set as defaults. No env vars needed.
Built on PR #180 standard stack by @thwu1.
Full logs for all 3 seeds included.