Skip to content

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)#287

Merged
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-xsa4-ema-int6-mlp3x-wd04-1.1271
Mar 23, 2026
Merged

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)#287
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-xsa4-ema-int6-mlp3x-wd04-1.1271

Conversation

@jfprincz
Copy link
Copy Markdown
Contributor

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

val_bpb: 1.1271 (sliding window, stride=64) | 15.5 MB | 8xH100 SXM, 600s

Progress from prior submissions

PR #70 PR #164 PR #198 This Delta vs #198
val_bpb (sliding) 1.1659 (s256) 1.1524 (s256) 1.1318 (s64) 1.1271 (s64) -0.0047
Layers 9 9 11 11
Params 21.8M 22.4M 26.8M 26.8M
Artifact 14.9 MB 15.4 MB 15.7 MB 15.5 MB -0.2 MB

Two new techniques on top of PR #198's 11-layer stack.

Key additions over PR #198

Change Impact
Exclusive Self Attention (XSA) on last 4 layers Removes self-value bias from attention output via orthogonal projection. Zero new parameters.
EMA replacing SWA (decay=0.997) Exponential moving average every step instead of periodic SWA checkpoints. Smoother weight averaging, better generalization and compression.

Everything else from PR #198 carries forward: 11 layers, OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric Value
Pre-quant val_bpb 1.1427
Int6 roundtrip val_bpb 1.1494
Int6 sliding val_bpb (s64) 1.1271
Steps completed (600s cap) 7,103
Step time 84ms
Model params 26,829,913
Artifact size 15,534,645 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s64 Artifact
1337 7,103 1.1271 15.53 MB
42 7,094 1.1286 15.75 MB
2025 7,107 1.1284 15.65 MB

Mean: 1.1280 | Variance: 0.0015 | Submitted: seed 1337

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@mohosy
Copy link
Copy Markdown

mohosy commented Mar 20, 2026

nice find on ema over swa, that makes alot of sense for smoother convergence. did you try different decay values or was 0.997 just the first thing that worked

mohosy pushed a commit to mohosy/parameter-golf that referenced this pull request Mar 21, 2026
Adds TTT (3-epoch SGD on val data) to jfprincz's openai#287 base (1.1271).
TTT is eval-time only so artifact size stays at ~15.5MB.
Projected score: ~1.122-1.124.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanalot
Copy link
Copy Markdown

himanalot commented Mar 21, 2026

@mohosy
was def tuned based on smaller runs.

@jfprincz
Copy link
Copy Markdown
Contributor Author

nice find on ema over swa, that makes alot of sense for smoother convergence. did you try different decay values or was 0.997 just the first thing that worked

tried 0.999, averaged too slowly and hurt BPB. 0.997 was better i so kept it

kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 21, 2026
4 proven optimizations from merged leaderboard entries:
- train_seq_len 1024→2048 (both top entries use this)
- SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author)
- grad_clip_norm 0.3 (both top entries use this)
- LRs 0.025/0.025/0.035 (from PR openai#287)

BigramHash stays at 2048 to avoid artifact size risk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanalot
Copy link
Copy Markdown

xsa and ema lowkey hurt mine, was getting similar numbers to urs but they boosted mine back up

@himanalot
Copy link
Copy Markdown

i took a lot of diff approaches from u tho

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
- Add leaderboard table: jfprincz 1.1271 is new target; mohosy racing same stack
- Add Reptile meta-TTT finding (PR openai#296): 10x better than naive TTT with SmearGate;
  error-guided TTT is negative; 13L crossover point identified
- Add SWA checkpoint count finding (PR openai#238): 84 checkpoints reverses quant gap;
  explains why our WD=1200 SWA showed no effect
- Update jfprincz entry to include PR openai#287 results (1.1271)
- Add meta-lessons 10 and 11
@himanalot
Copy link
Copy Markdown

himanalot commented Mar 21, 2026

im switching off transformers
gonna start from scratch but lowkey might be worth to make my own mechanism (maybe i'll prove myself wrong tho)

kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 21, 2026
…erparams

Aggressive iteration to beat SOTA 1.1428. Changes from SmearGate v5:
- XSA: exclusive self-attention on last 4 layers (zero-param, from PR openai#287)
- EMA decay=0.997 replaces SWA (continuous averaging, from PR openai#287)
- train_seq_len 1024→2048
- grad_clip_norm 0.3
- LRs 0.025/0.025/0.035
- WD 0.04, 11 layers, BigramHash 2048

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 21, 2026
PR openai#287 (SOTA val_bpb=1.1271) with Overtone embedding init added.
Overtone reshapes tok_emb singular values to power-law decay for
better int6 quantization. No KURE, R2, tanh, or TTT — clean minimal
delta over proven SOTA. Modal launcher with CLI flags for EMA decay,
num_layers, and XSA sweep.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 21, 2026
Comprehensive analysis of Parameter Golf techniques across 3 rounds of
deep research, validated against 100+ PRs from openai/parameter-golf.
Covers SOTA stack (PR openai#287, 1.1271 BPB), proven techniques (TTT, XSA,
EMA), debunked approaches (LZMA, BTT, Int5), and prioritized action plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sseanliu added a commit to sseanliu/parameter-golf that referenced this pull request Mar 21, 2026
Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation.
Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function,
TTT call before torch.compile in eval section.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 21, 2026

@jfprincz bet thats good to know, 0.997 seems like the sweet spot then. gonna run with that

@himanalot bro switching off transformers is wild but honestly respect the move, if it hits it hits lol. what are you thinking like state space or something completely different

ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 21, 2026
12.5MB compressed with 9 layers → room for 10th layer.
Top PRs (openai#287, openai#309) use 10-11 layers for better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 21, 2026
11 layers + 3x MLP — may be tight on 16MB budget. Will test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sseanliu added a commit to sseanliu/parameter-golf that referenced this pull request Mar 21, 2026
…ded eval context

Non-record research submission. Proposes caching K/V pairs across sliding
windows to extend effective context from 2K to 50K+ tokens at eval time.
Backward-looking, zero artifact cost, rule-compliant. Implementation provided
but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.
HyperPotatoNeo added a commit to HyperPotatoNeo/parameter-golf that referenced this pull request Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264),
MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048),
SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer.
Single-seed result (seed=1337), ~8903 steps on 8xH100.
lolrazh added a commit to lolrazh/parameter-golf that referenced this pull request Mar 21, 2026
Implemented EMA (exponential moving average) weight averaging as
alternative to SWA, matching SOTA PR openai#287 technique. ralph_020
with decay=0.997 regressed badly (initial weights 12% of EMA at
700 steps). Testing lower decay next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanalot
Copy link
Copy Markdown

himanalot commented Mar 21, 2026

@mohosy
you'll see. i j wanna try something diff bc it's gotten boring ngl

sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

@himanalot lol fair enough, everyone doing the same xsa ema stack does get kinda stale. lmk what you come up with tho im genuinley curious

@himanalot
Copy link
Copy Markdown

@mohosy xsa didn't work for me and ema provided little benefit. ssm didn't work either (not worth taking the hit to avg step time). tbh i was prob wrong in what i said. not sure why everyone's getting benefits with xsa and i'm seeing a 0.006 hit tho.

sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 23, 2026
New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights.
EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338
(best open PR, 1.1254 bpb) which also uses EMA.

11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available.

Full stack on thwu1 base (1.1428):
- TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers (PR openai#287)
- EMA: decay=0.9999, shadow model used at final eval
- TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget

Artifact: ~15.64MB. H100 validation pending.
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 23, 2026
T4 ablation (1000 steps, 4 variants):
V2 bigram=10240 no trigram:     5.4379 loss  WINNER
V4 bigram=8192 + trigram=8192:  5.6956 loss
V3 bigram=4096 + trigram=20480: 5.7924 loss  (was our submission)
V1 bigram=4096 no trigram:      5.8414 loss
TrigramHash adds noise, bigram reduction actively hurts.
Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base.
These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage)
applied cleanly on the openai#1 submission.
@cocohearts cocohearts merged commit 0d44464 into openai:main Mar 23, 2026
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
…nt6-mlp3x-wd04-1.1271

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 28, 2026
- matrix_lr: 0.04→0.025, scalar_lr: 0.04→0.025 (from PR openai#287, openai#309)
- tied_embed_lr: 0.05→0.035
- muon_momentum: 0.95→0.97, warmup_start: 0.85→0.90
- muon_momentum_warmup_steps: 500→800
- warmdown_iters: 1200→2000 (longer cooldown)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
barneywohl added a commit to barneywohl/parameter-golf that referenced this pull request Mar 30, 2026
…816 (val_bpb 1.1116)

3-seed mean: 1.1116 ± 0.0005
Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118

Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky)
+ coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112)
+ fullgraph=True torch.compile

Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants