11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds) by mahsumaktas · Pull Request #333 · openai/parameter-golf

mahsumaktas · 2026-03-21T10:39:47Z

Summary

Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB

23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.

Techniques

11 transformer layers + XSA on last 4 layers
SmearGate + BigramHash(2048) + OrthoInit
INT6 per-row quantization + zstd-22 + FP16 tied embedding + Late-K FP16
SWA every 50 steps (fp32 accumulation) — bf16 causes catastrophic loss
Muon WD=0.04 + grad clip 0.3 + RoPE base 50K
Overtone SVD init + Phase-transition residual mixing
MLP 2.75x — sweet spot (3x exceeds 16MB with SmearGate at 11L)

Results (3 seeds)

Seed	Sliding BPB	Post-quant BPB	Artifact
1337	1.1538	1.1766	15.99 MB
42	1.1565	1.1790	15.87 MB
7	1.1593	1.1820	15.93 MB
Mean	1.1565

Key Findings from 23 Runs

EMA(0.997) causes 0.14 BPB quant gap — SWA far better for our stack
11L MLP 3x exceeds 16MB with SmearGate+BigramHash
SmearGate removal loses more than MLP 3x gains — bigram context matters
XSA needs GQA-compatible v expansion (repeat_interleave, bug found and fixed)
Seq curriculum doesn't work — SWA checkpoint incompatibility across seq lengths
Depth recurrence works but dim=640 too narrow; dim=768+ exceeds 16MB
Higher LR (0.03) improves BPB but worsens compression (larger weights)
Late QAT (75%) reduces quant gap (0.023 -> 0.006) but fewer steps

Run command

NUM_LAYERS=11 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=2048 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_MULT=2.75 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.04 WARMDOWN_ITERS=3000 \
SWA_ENABLED=1 SWA_EVERY=50 ROPE_BASE=50000 EVAL_STRIDE=64 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Test plan

Runs reproducibly on 8xH100 SXM in under 10 minutes
Artifact under 16 MB (15.87-15.99 MB)
3-seed validation (mean 1.1565, std 0.0028)
Sliding window eval completes within 10 minutes

Built with Claude Code

…l_bpb=1.1754) Combines 10 orthogonal improvements over the naive baseline: - Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact) - FP16 tied embedding export (near-zero quantization gap) - MLP 2.5x expansion - SmearGate + BigramHash bigram-aware modules - OrthoInit + muP scaling + phase-transition residual mixing - Muon weight decay (0.02) - Stochastic Weight Averaging (4 checkpoints) - Sliding window evaluation (stride=64) - Tuned hyperparameters (grad_clip=0.3, warmdown=3000) 8xH100 SXM, 9919 steps in 10 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major upgrade from V1 to V2 with 23 GPU runs on 8xH100: - 11 layers (was 9) + XSA on last 4 layers - MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L - RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048) - SWA/50 with fp32 accumulation (bf16 catastrophic fix) - OrthoInit + Overtone SVD + Phase-transition residual mixing - INT6 + zstd-22 + FP16 tied embed + Late-K FP16 3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028) Artifact: 15.87-15.99 MB (all under 16MB) 23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep, WD sweep, QAT, MLP 3x — documented in README. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-stage investigation into training data selection for Parameter Golf: Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most reliable (rho=0.984). But all 80 shards have nearly identical bigram statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise). Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard variance is 535x larger than between-shard. Selected top 12% by bigram CE and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006). Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero. Conclusion: On FineWeb (already filtered), hard data selection trades diversity for match quality, and diversity wins. Corroborated by PRs openai#737, openai#623, openai#333 and Sachdeva et al. (ICLR 2025). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mahsum and others added 2 commits March 20, 2026 10:19

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Data ordering & selection — negative result on FineWeb #772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538

mahsumaktas commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mahsumaktas commented Mar 21, 2026

Summary

Techniques

Results (3 seeds)

Key Findings from 23 Runs

Run command

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant