Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090 by jainpranjal97 · Pull Request #1125 · openai/parameter-golf

jainpranjal97 · 2026-03-30T07:24:55Z

Summary

Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.

Key findings for the community:

XSA on ALL layers > XSA on last 4 (-0.0018 BPB). Every top entry uses XSA only on deepest 3-4 layers — all-layer XSA is consistently better.
qk_gain_init = 4.0 (-0.0039 BPB vs default 1.5). Sharper attention patterns help small models significantly. Swept 1.5 → 2.0 → 3.0 → 4.0 with monotonic gains.
Warmdown calibration for wallclock-capped training (-0.0078 BPB). Default warmdown_iters=1200 means the LR never reaches full strength when wallclock-capped at 10 min.
Pre-quant vs post-quant divergence: XSA Gating (learned per-head gate) achieved 1.1932 pre-quant (better than best) but 1.1961 post int8+zlib (worse). Architectural choices that improve FP loss can degrade quantized loss.

Novel approaches tested (all documented with negative results):

Approach	Result	Why it failed
Progressive Layer Growing (5→11L at 60%)	+0.0057	5L capacity ceiling
Depth Recurrence 4×3 + LoRA16	+0.0753	torch.compile bypass + optimization conflicts
XSA Gating (learned per-head gate)	+0.0015	Quantizes worse despite better FP loss
Cosine warmdown	+0.0039	Linear warmdown already optimal

Stack

11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.

Full experiment log with 45 runs in the README.

Test plan

Verified val_bpb 1.1946 on 1×RTX 5090 (60-min run)
All 45 experiments logged with reproducible configurations
train_gpt.py included and runnable

🤖 Generated with Claude Code

45 systematic experiments on consumer GPU. Key findings: - XSA on ALL layers beats XSA on last 4 (-0.0018 BPB) - qk_gain_init=4.0 significantly better than default 1.5 (-0.0039) - Warmdown calibration critical for wallclock-capped training (-0.0078) - 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Architectural innovations from PR openai#1204 (1.1063 BPB record): - QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB) - Parallel Residuals: dual-lane from physical layer 7+ - Attn reads lane0, MLP reads lane1, learned cross-lane writes - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2] - Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder - Delayed activation at step 3000 (avoids disrupting early training) - Tied MLP weights (no extra params, keeps model within 16MB) - Bigram dim reduced 128->112 for budget headroom - Refactored forward into _run_backbone() for DRY encoder/decoder/parallel

3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189. Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim], logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule. Base: PR openai#1019. SLOT based on arXiv:2505.12392v2. Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag mentioned this pull request Mar 31, 2026

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Open

notapplica mentioned this pull request Mar 31, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

clarkkev mentioned this pull request Apr 1, 2026

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218

Open

This was referenced Apr 1, 2026

Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) resouer/parameter-golf#2

Closed

Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) #1229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale

jainpranjal97 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jainpranjal97 commented Mar 30, 2026

Summary

Key findings for the community:

Novel approaches tested (all documented with negative results):

Stack

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant