Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) by jfprincz · Pull Request #287 · openai/parameter-golf

jfprincz · 2026-03-20T23:25:59Z

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

val_bpb: 1.1271 (sliding window, stride=64) | 15.5 MB | 8xH100 SXM, 600s

Progress from prior submissions

	PR #70	PR #164	PR #198	This	Delta vs #198
val_bpb (sliding)	1.1659 (s256)	1.1524 (s256)	1.1318 (s64)	1.1271 (s64)	-0.0047
Layers	9	9	11	11	—
Params	21.8M	22.4M	26.8M	26.8M	—
Artifact	14.9 MB	15.4 MB	15.7 MB	15.5 MB	-0.2 MB

Two new techniques on top of PR #198's 11-layer stack.

Key additions over PR #198

Change	Impact
Exclusive Self Attention (XSA) on last 4 layers	Removes self-value bias from attention output via orthogonal projection. Zero new parameters.
EMA replacing SWA (decay=0.997)	Exponential moving average every step instead of periodic SWA checkpoints. Smoother weight averaging, better generalization and compression.

Everything else from PR #198 carries forward: 11 layers, OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric	Value
Pre-quant val_bpb	1.1427
Int6 roundtrip val_bpb	1.1494
Int6 sliding val_bpb (s64)	1.1271
Steps completed (600s cap)	7,103
Step time	84ms
Model params	26,829,913
Artifact size	15,534,645 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s64	Artifact
1337	7,103	1.1271	15.53 MB
42	7,094	1.1286	15.75 MB
2025	7,107	1.1284	15.65 MB

Mean: 1.1280 | Variance: 0.0015 | Submitted: seed 1337

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

mohosy · 2026-03-20T23:59:26Z

nice find on ema over swa, that makes alot of sense for smoother convergence. did you try different decay values or was 0.997 just the first thing that worked

Adds TTT (3-epoch SGD on val data) to jfprincz's openai#287 base (1.1271). TTT is eval-time only so artifact size stays at ~15.5MB. Projected score: ~1.122-1.124. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanalot · 2026-03-21T00:11:38Z

@mohosy
was def tuned based on smaller runs.

jfprincz · 2026-03-21T00:17:29Z

nice find on ema over swa, that makes alot of sense for smoother convergence. did you try different decay values or was 0.997 just the first thing that worked

tried 0.999, averaged too slowly and hurt BPB. 0.997 was better i so kept it

4 proven optimizations from merged leaderboard entries: - train_seq_len 1024→2048 (both top entries use this) - SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author) - grad_clip_norm 0.3 (both top entries use this) - LRs 0.025/0.025/0.035 (from PR openai#287) BigramHash stays at 2048 to avoid artifact size risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanalot · 2026-03-21T00:57:42Z

xsa and ema lowkey hurt mine, was getting similar numbers to urs but they boosted mine back up

himanalot · 2026-03-21T00:58:13Z

i took a lot of diff approaches from u tho

- Add leaderboard table: jfprincz 1.1271 is new target; mohosy racing same stack - Add Reptile meta-TTT finding (PR openai#296): 10x better than naive TTT with SmearGate; error-guided TTT is negative; 13L crossover point identified - Add SWA checkpoint count finding (PR openai#238): 84 checkpoints reverses quant gap; explains why our WD=1200 SWA showed no effect - Update jfprincz entry to include PR openai#287 results (1.1271) - Add meta-lessons 10 and 11

himanalot · 2026-03-21T01:20:49Z

im switching off transformers
gonna start from scratch but lowkey might be worth to make my own mechanism (maybe i'll prove myself wrong tho)

…erparams Aggressive iteration to beat SOTA 1.1428. Changes from SmearGate v5: - XSA: exclusive self-attention on last 4 layers (zero-param, from PR openai#287) - EMA decay=0.997 replaces SWA (continuous averaging, from PR openai#287) - train_seq_len 1024→2048 - grad_clip_norm 0.3 - LRs 0.025/0.025/0.035 - WD 0.04, 11 layers, BigramHash 2048 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#287 (SOTA val_bpb=1.1271) with Overtone embedding init added. Overtone reshapes tok_emb singular values to power-law decay for better int6 quantization. No KURE, R2, tanh, or TTT — clean minimal delta over proven SOTA. Modal launcher with CLI flags for EMA decay, num_layers, and XSA sweep. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Comprehensive analysis of Parameter Golf techniques across 3 rounds of deep research, validated against 100+ PRs from openai/parameter-golf. Covers SOTA stack (PR openai#287, 1.1271 BPB), proven techniques (TTT, XSA, EMA), debunked approaches (LZMA, BTT, Int5), and prioritized action plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.

mohosy · 2026-03-21T04:19:48Z

@jfprincz bet thats good to know, 0.997 seems like the sweet spot then. gonna run with that

@himanalot bro switching off transformers is wild but honestly respect the move, if it hits it hits lol. what are you thinking like state space or something completely different

12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

11 layers + 3x MLP — may be tight on 16MB budget. Will test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ded eval context Non-record research submission. Proposes caching K/V pairs across sliding windows to extend effective context from 2K to 50K+ tokens at eval time. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.

Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.

Implemented EMA (exponential moving average) weight averaging as alternative to SWA, matching SOTA PR openai#287 technique. ralph_020 with decay=0.997 regressed badly (initial weights 12% of EMA at 700 steps). Testing lower decay next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanalot · 2026-03-21T07:52:23Z

@mohosy
you'll see. i j wanna try something diff bc it's gotten boring ngl

Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360). Three novel additions on thwu1 SOTA base (1.1428): - TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers, from PR openai#287 - TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only) Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.

mohosy · 2026-03-23T00:25:28Z

@himanalot lol fair enough, everyone doing the same xsa ema stack does get kinda stale. lmk what you come up with tho im genuinley curious

himanalot · 2026-03-23T00:46:28Z

@mohosy xsa didn't work for me and ema provided little benefit. ssm didn't work either (not worth taking the hit to avg step time). tbh i was prob wrong in what i said. not sure why everyone's getting benefits with xsa and i'm seeing a 0.006 hit tho.

New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights. EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338 (best open PR, 1.1254 bpb) which also uses EMA. 11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available. Full stack on thwu1 base (1.1428): - TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers (PR openai#287) - EMA: decay=0.9999, shadow model used at final eval - TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget Artifact: ~15.64MB. H100 validation pending.

T4 ablation (1000 steps, 4 variants): V2 bigram=10240 no trigram: 5.4379 loss WINNER V4 bigram=8192 + trigram=8192: 5.6956 loss V3 bigram=4096 + trigram=20480: 5.7924 loss (was our submission) V1 bigram=4096 no trigram: 5.8414 loss TrigramHash adds noise, bigram reduction actively hurts. Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base. These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage) applied cleanly on the openai#1 submission.

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

- matrix_lr: 0.04→0.025, scalar_lr: 0.04→0.025 (from PR openai#287, openai#309) - tied_embed_lr: 0.05→0.035 - muon_momentum: 0.95→0.97, warmup_start: 0.85→0.90 - muon_momentum_warmup_steps: 500→800 - warmdown_iters: 1200→2000 (longer cooldown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

d758328

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

mohosy mentioned this pull request Mar 21, 2026

Non-record: 11L SwiGLU + XSA4 + EMA + U-Net + AdamW TTT (pending compute) #291

Open

machdragon mentioned this pull request Mar 21, 2026

PR #287 base + Overtone init + 12L option machdragon/parameter-golf#5

Open

7 tasks

devin-ai-integration bot mentioned this pull request Mar 21, 2026

11L MLP3x int6: val_bpb=1.12109 (3-seed mean=1.12171, sliding window) andrewgcodes/parameter-golf#5

Open

5 tasks

sseanliu mentioned this pull request Mar 21, 2026

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303

Open

dennisimoo mentioned this pull request Mar 21, 2026

Record: 11L XSA4 + EMA + Batch524K + zstd fallback (val_bpb: 1.1357) #307

Open

jfprincz mentioned this pull request Mar 21, 2026

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315

Merged

sseanliu mentioned this pull request Mar 21, 2026

Neural Cache: Cross-Window KV Caching for Extended Eval Context (research proposal) #318

Open

3 tasks

saml212 mentioned this pull request Mar 21, 2026

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332

Open

sahiee-dev mentioned this pull request Mar 22, 2026

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean) #430

Closed

abaybektursun mentioned this pull request Mar 22, 2026

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399

Open

tmustier mentioned this pull request Mar 23, 2026

Track 10min_16mb: PR #287 family rerun at 585s wallclock (mean val_bpb=1.1346) #483

Closed

This was referenced Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

cocohearts merged commit 0d44464 into openai:main Mar 23, 2026

cocohearts mentioned this pull request Mar 23, 2026

Update README leaderboard with merged record submissions #561

Merged

This was referenced Mar 24, 2026

Non-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours #628

Open

Non-record: 11L GEPA + 25k Steps + Pure Int6 + Legal TTT (val_bpb=1.0944) - unlimited compute category #644

Open

anthony-maio mentioned this pull request Mar 24, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657

Closed

6 tasks

Christopher-Lee-McClendon mentioned this pull request Mar 25, 2026

Non-record: 11L GEPA + 30k Steps + Pure Int6 + Legal TTT (val_bpb=1.0920) #668

Open

anthony-maio mentioned this pull request Mar 25, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #175

Open

6 tasks

sunnypatneedi mentioned this pull request Mar 25, 2026

Record: AdamW TTT 30ep Cosine + Per-Layer LR (val_bpb: 1.0705) #771

Closed

Bortlesboat mentioned this pull request Mar 26, 2026

10L + Multi-Order N-gram Backoff (0.9123 BPB) #802

Open

pentxayc mentioned this pull request Mar 26, 2026

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803

Open

3 tasks

haikosys mentioned this pull request Mar 27, 2026

Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653) #918

Open

nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026

Merge pull request openai#287 from jfprincz/submission/11l-xsa4-ema-i…

d947625

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

Naazimsnh02 mentioned this pull request Mar 28, 2026

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT #1033

Open

7 tasks

vimeto mentioned this pull request Mar 29, 2026

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) #1072

Draft

mikeapedia mentioned this pull request Mar 29, 2026

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

barneywohl mentioned this pull request Mar 30, 2026

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135

Open

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)#287

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)#287
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-xsa4-ema-int6-mlp3x-wd04-1.1271

jfprincz commented Mar 20, 2026

Uh oh!

mohosy commented Mar 20, 2026

Uh oh!

himanalot commented Mar 21, 2026 •

edited

Loading

Uh oh!

jfprincz commented Mar 21, 2026

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

himanalot commented Mar 21, 2026 •

edited

Loading

Uh oh!

mohosy commented Mar 21, 2026

Uh oh!

himanalot commented Mar 21, 2026 •

edited

Loading

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

himanalot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jfprincz commented Mar 20, 2026