Skip to content

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125

Open
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale
Open

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale

Conversation

@jainpranjal97
Copy link
Copy Markdown

Summary

Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.

Key findings for the community:

  • XSA on ALL layers > XSA on last 4 (-0.0018 BPB). Every top entry uses XSA only on deepest 3-4 layers — all-layer XSA is consistently better.
  • qk_gain_init = 4.0 (-0.0039 BPB vs default 1.5). Sharper attention patterns help small models significantly. Swept 1.5 → 2.0 → 3.0 → 4.0 with monotonic gains.
  • Warmdown calibration for wallclock-capped training (-0.0078 BPB). Default warmdown_iters=1200 means the LR never reaches full strength when wallclock-capped at 10 min.
  • Pre-quant vs post-quant divergence: XSA Gating (learned per-head gate) achieved 1.1932 pre-quant (better than best) but 1.1961 post int8+zlib (worse). Architectural choices that improve FP loss can degrade quantized loss.

Novel approaches tested (all documented with negative results):

Approach Result Why it failed
Progressive Layer Growing (5→11L at 60%) +0.0057 5L capacity ceiling
Depth Recurrence 4×3 + LoRA16 +0.0753 torch.compile bypass + optimization conflicts
XSA Gating (learned per-head gate) +0.0015 Quantizes worse despite better FP loss
Cosine warmdown +0.0039 Linear warmdown already optimal

Stack

11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.

Full experiment log with 45 runs in the README.

Test plan

  • Verified val_bpb 1.1946 on 1×RTX 5090 (60-min run)
  • All 45 experiments logged with reproducible configurations
  • train_gpt.py included and runnable

🤖 Generated with Claude Code

45 systematic experiments on consumer GPU. Key findings:
- XSA on ALL layers beats XSA on last 4 (-0.0018 BPB)
- qk_gain_init=4.0 significantly better than default 1.5 (-0.0039)
- Warmdown calibration critical for wallclock-capped training (-0.0078)
- 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 1, 2026
Architectural innovations from PR openai#1204 (1.1063 BPB record):
- QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB)
- Parallel Residuals: dual-lane from physical layer 7+
  - Attn reads lane0, MLP reads lane1, learned cross-lane writes
  - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2]
- Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder
  - Delayed activation at step 3000 (avoids disrupting early training)
  - Tied MLP weights (no extra params, keeps model within 16MB)
- Bigram dim reduced 128->112 for budget headroom
- Refactored forward into _run_backbone() for DRY encoder/decoder/parallel
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 1, 2026
3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189.

Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim],
logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule.

Base: PR openai#1019. SLOT based on arXiv:2505.12392v2.
Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant