Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds) by zachgoldfine44 · Pull Request #450 · openai/parameter-golf

zachgoldfine44 · 2026-03-22T18:17:32Z

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)

3-Seed Results

Seed	val_bpb	artifact_bytes	valid
1337	1.14749	14,014,540	yes
42	1.14575	14,104,510	yes
7	1.14662	14,385,363	yes
Mean	1.14662
Std	0.00071

Training: 600s, ~5,370 steps, ~112 ms/step on 8xH100 SXM
Eval: ~120s (20s roundtrip + 98s sliding window stride=64)
No TTT

Key Innovations

Catalytic Residual Connections (novel): Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).
12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).
BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).
Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.
SWA: Stochastic weight averaging from last 20% of warmdown.

Run Command

pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

All parameters set as defaults. No env vars needed.

Built on PR #180 standard stack by @thwu1.

Full logs for all 3 seeds included.

…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)

mohosy · 2026-03-23T00:23:05Z

catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QA…

b3547e3

…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

joshuaswarren mentioned this pull request Mar 22, 2026

Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) #474

Open

skarakulak mentioned this pull request Mar 23, 2026

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64 #507

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa

zachgoldfine44 commented Mar 22, 2026

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zachgoldfine44 commented Mar 22, 2026

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

3-Seed Results

Key Innovations

Run Command

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants