Skip to content

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333

Open
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538
Open

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538

Conversation

@mahsumaktas
Copy link
Copy Markdown

Summary

Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB

23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.

Techniques

  • 11 transformer layers + XSA on last 4 layers
  • SmearGate + BigramHash(2048) + OrthoInit
  • INT6 per-row quantization + zstd-22 + FP16 tied embedding + Late-K FP16
  • SWA every 50 steps (fp32 accumulation) — bf16 causes catastrophic loss
  • Muon WD=0.04 + grad clip 0.3 + RoPE base 50K
  • Overtone SVD init + Phase-transition residual mixing
  • MLP 2.75x — sweet spot (3x exceeds 16MB with SmearGate at 11L)

Results (3 seeds)

Seed Sliding BPB Post-quant BPB Artifact
1337 1.1538 1.1766 15.99 MB
42 1.1565 1.1790 15.87 MB
7 1.1593 1.1820 15.93 MB
Mean 1.1565

Key Findings from 23 Runs

  • EMA(0.997) causes 0.14 BPB quant gap — SWA far better for our stack
  • 11L MLP 3x exceeds 16MB with SmearGate+BigramHash
  • SmearGate removal loses more than MLP 3x gains — bigram context matters
  • XSA needs GQA-compatible v expansion (repeat_interleave, bug found and fixed)
  • Seq curriculum doesn't work — SWA checkpoint incompatibility across seq lengths
  • Depth recurrence works but dim=640 too narrow; dim=768+ exceeds 16MB
  • Higher LR (0.03) improves BPB but worsens compression (larger weights)
  • Late QAT (75%) reduces quant gap (0.023 -> 0.006) but fewer steps

Run command

NUM_LAYERS=11 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=2048 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_MULT=2.75 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.04 WARMDOWN_ITERS=3000 \
SWA_ENABLED=1 SWA_EVERY=50 ROPE_BASE=50000 EVAL_STRIDE=64 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • Runs reproducibly on 8xH100 SXM in under 10 minutes
  • Artifact under 16 MB (15.87-15.99 MB)
  • 3-seed validation (mean 1.1565, std 0.0028)
  • Sliding window eval completes within 10 minutes

Built with Claude Code

Mahsum and others added 2 commits March 20, 2026 10:19
…l_bpb=1.1754)

Combines 10 orthogonal improvements over the naive baseline:
- Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact)
- FP16 tied embedding export (near-zero quantization gap)
- MLP 2.5x expansion
- SmearGate + BigramHash bigram-aware modules
- OrthoInit + muP scaling + phase-transition residual mixing
- Muon weight decay (0.02)
- Stochastic Weight Averaging (4 checkpoints)
- Sliding window evaluation (stride=64)
- Tuned hyperparameters (grad_clip=0.3, warmdown=3000)

8xH100 SXM, 9919 steps in 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrade from V1 to V2 with 23 GPU runs on 8xH100:
- 11 layers (was 9) + XSA on last 4 layers
- MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L
- RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048)
- SWA/50 with fp32 accumulation (bf16 catastrophic fix)
- OrthoInit + Overtone SVD + Phase-transition residual mixing
- INT6 + zstd-22 + FP16 tied embed + Late-K FP16

3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028)
Artifact: 15.87-15.99 MB (all under 16MB)

23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep,
WD sweep, QAT, MLP 3x — documented in README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 25, 2026
Two-stage investigation into training data selection for Parameter Golf:

Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most
reliable (rho=0.984). But all 80 shards have nearly identical bigram
statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise).

Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard
variance is 535x larger than between-shard. Selected top 12% by bigram CE
and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006).

Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model
perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero.

Conclusion: On FineWeb (already filtered), hard data selection trades
diversity for match quality, and diversity wins. Corroborated by PRs openai#737,
openai#623, openai#333 and Sachdeva et al. (ICLR 2025).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant