Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) by chris-buckley · Pull Request #317 · openai/parameter-golf

chris-buckley · 2026-03-21T06:16:06Z

This takes the current best public training stack and makes one bet on top of it: full-model SGD test-time training at eval.

The training recipe is the 11L SmearGate/BigramHash/XSA4/EMA/int6+zstd-22 setup that's been winning. I kept it intact and added TTT as an orthogonal eval-time pass. It costs zero training budget. The SGD pass (3 epochs, lr=0.002, momentum=0.9, first 2 blocks frozen) runs on the dequantized checkpoint before scoring and takes about 50-80s.

What's different from Compression-Funded MLP3x

XSA on the last 4 layers
EMA instead of SWA
TTT at eval time (the big one)
Small LR bumps: matrix_lr 0.02 to 0.025, scalar_lr 0.02 to 0.025, tied_embed_lr 0.03 to 0.035
eval_stride 256 to 64

Results

Seed	Post-TTT val_bpb	Steps	ms/step	Artifact bytes
1337	1.1419	5,344	109.2	15,578,775
1338	1.1464	4,559	131.6	15,661,221
Mean	1.1442

Both artifacts under 16 MB. Both seeds beat the prior best single seed of 1.1598. Mean beats the prior mean of 1.1647 by 0.0205.

8xH100 SXM on community cloud. The two seeds ran on different nodes, which is why the step times differ. SDPA fallback since neither node had FA3 installed.

Training stack

11 layers, 512 dim, 8 heads / 4 KV heads, MLP 3x, SmearGate, BigramHash(2048), OrthoInit, muP-style output scaling, Muon/AdamW with WD=0.04, XSA on last 4 layers, EMA, int6 mixed quant + zstd-22.

Compatibility

The script falls back from FA3 to PyTorch SDPA automatically. I had to add manual KV head repeat for GQA since PyTorch 2.4 doesn't have enable_gqa, and clear the RoPE cache before TTT to avoid an inference tensor error during backward. It needs the zstandard package for zstd-22 compression (zlib puts the artifact over 16 MB).

Best public training stack (11L, MLP3x, SmearGate, BigramHash, XSA4, EMA, int6+zstd-22) plus full-model SGD TTT at eval time. Seed 1337 on 8xH100 SXM.

Seed 1337: 1.1419, Seed 1338: 1.1464, Seed 1339: 1.1543. All 3 seeds beat prior SOTA (1.1598). Mean delta vs prior: -0.0172. Note: Seed 1339 artifact 222KB over 16MB limit (slower node).

Drop seed 1339 (artifact over 16MB limit). Seeds 1337+1338 both under limit with mean val_bpb=1.1442, beating prior SOTA by 0.0205.

cocohearts · 2026-03-22T01:08:30Z

ah nice try, test time training unfortunately is not in "the spirit of the challenge"

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1419)

0c2a5ac

Best public training stack (11L, MLP3x, SmearGate, BigramHash, XSA4, EMA, int6+zstd-22) plus full-model SGD TTT at eval time. Seed 1337 on 8xH100 SXM.

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

chris-buckley added 3 commits March 21, 2026 16:47

Add seed 1338 results (val_bpb=1.1464), 2-seed mean=1.1442

b028e01

Add seed 1339, complete 3-seed sweep (mean val_bpb=1.1475)

72fb6c9

Seed 1337: 1.1419, Seed 1338: 1.1464, Seed 1339: 1.1543. All 3 seeds beat prior SOTA (1.1598). Mean delta vs prior: -0.0172. Note: Seed 1339 artifact 222KB over 16MB limit (slower node).

Keep only 2 valid seeds (both under 16MB), mean val_bpb=1.1442

40ffdcf

Drop seed 1339 (artifact over 16MB limit). Seeds 1337+1338 both under limit with mean val_bpb=1.1442, beating prior SOTA by 0.0205.

chris-buckley changed the title ~~Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1419)~~ Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) Mar 21, 2026

cocohearts added the invalid This doesn't seem right label Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)#317

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)#317
chris-buckley wants to merge 4 commits intoopenai:mainfrom
chris-buckley:record/11L-XSA4-EMA-TTT-Int6-MLP3x

chris-buckley commented Mar 21, 2026 •

edited

Loading

Uh oh!

cocohearts commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chris-buckley commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's different from Compression-Funded MLP3x

Results

Training stack

Compatibility

Uh oh!

cocohearts commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chris-buckley commented Mar 21, 2026 •

edited

Loading