Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
Open
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
Conversation
45 systematic experiments on consumer GPU. Key findings: - XSA on ALL layers beats XSA on last 4 (-0.0018 BPB) - qk_gain_init=4.0 significantly better than default 1.5 (-0.0039) - Warmdown calibration critical for wallclock-capped training (-0.0078) - 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23
added a commit
to angela231005/parameter-golf
that referenced
this pull request
Apr 1, 2026
Architectural innovations from PR openai#1204 (1.1063 BPB record): - QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB) - Parallel Residuals: dual-lane from physical layer 7+ - Attn reads lane0, MLP reads lane1, learned cross-lane writes - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2] - Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder - Delayed activation at step 3000 (avoids disrupting early training) - Tied MLP weights (no extra params, keeps model within 16MB) - Bigram dim reduced 128->112 for budget headroom - Refactored forward into _run_backbone() for DRY encoder/decoder/parallel
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 1, 2026
3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189. Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim], logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule. Base: PR openai#1019. SLOT based on arXiv:2505.12392v2. Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 1, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.
Key findings for the community:
Novel approaches tested (all documented with negative results):
Stack
11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.
Full experiment log with 45 runs in the README.
Test plan
🤖 Generated with Claude Code