Skip to content

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450

Open
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa
Open

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa

Conversation

@zachgoldfine44
Copy link
Copy Markdown

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)

3-Seed Results

Seed val_bpb artifact_bytes valid
1337 1.14749 14,014,540 yes
42 1.14575 14,104,510 yes
7 1.14662 14,385,363 yes
Mean 1.14662
Std 0.00071
  • Training: 600s, ~5,370 steps, ~112 ms/step on 8xH100 SXM
  • Eval: ~120s (20s roundtrip + 98s sliding window stride=64)
  • No TTT

Key Innovations

  1. Catalytic Residual Connections (novel): Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).

  2. 12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).

  3. BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).

  4. Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.

  5. SWA: Stochastic weight averaging from last 20% of warmdown.

Run Command

pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

All parameters set as defaults. No env vars needed.

Built on PR #180 standard stack by @thwu1.

Full logs for all 3 seeds included.

…T (val_bpb=1.1466, mean 3 seeds)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuaswarren added a commit to joshuaswarren/parameter-golf that referenced this pull request Mar 22, 2026
…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)

First submission combining 6 independently-proven architecture improvements:
- Catalytic Residuals (PR openai#450, -0.024 bpb)
- Value Residual/ResFormer (PR openai#413, -0.015 bpb)
- Gated Attention (PR openai#413, -0.003 bpb)
- BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048)
- 12 Layers (-0.023 bpb vs 11L)
- 3x MLP

8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants