ignore: withdrew submission by fielding · Pull Request #757 · openai/parameter-golf

fielding · 2026-03-25T19:01:53Z

Aggressive SGD TTT (val_bpb: 1.1124)

3-seed mean val_bpb: 1.1124 (std=0.0008) | 15.4 MB artifact | 8xH100 SXM, 600s training + 591s eval

Results

Seed	val_bpb (sliding, s64)	Artifact
1337	1.1129	15,405,733
42	1.1128	~15.4M
2024	1.1114	~15.4M
Mean ± Std	1.1124 ± 0.0008

Approach

Standard 11L architecture, nothing exotic on the model side. The interesting part is the TTT. The base model trains for 600s, then TTT adapts all weights via SGD for 30 epochs on the validation data (score-first protocol).

The conventional wisdom is TTT at LR=0.002 for 3 epochs. We ran 20+ configurations on 4xH200 and found that cranking the LR to 1.0 and unfreezing every block turns a -0.0025 BPB technique into a -0.041 BPB technique. That's a 16x improvement from the same underlying method. It's like finding out your car has a sport mode you never tried.

TTT Configuration

I swept this on 4xH200 before validating on 8xH100. The sweep told the whole story.

Parameter	Our Value	PR #549 (merged SOTA)
LR	1.0	0.002
Epochs	30	3
Freeze blocks	0 (all unfrozen)	0
Momentum	0.9	0.9
TTT gain	-0.041 BPB	-0.0025 BPB

TTT LR Sweep (4xH200, 20 epochs, freeze=2)

LR	Sliding BPB
0.01	1.1489
0.02	1.1471
0.05	1.1444
0.1	1.1422
0.2	1.1400
0.5	1.1351
0.7	1.1327
0.8	1.1355
1.0	1.1585 (diverged)

BPB just keeps getting better as LR goes up... until it doesn't. Peak at 0.7 with 2 frozen blocks.

Unfreezing all blocks (4xH200, 20 epochs)

LR	freeze=2	freeze=0	Delta
0.7	1.1327	1.1255	-0.007
1.0	diverged	1.1183	—
1.5	diverged	1.1110	—

This was the breakthrough. With 2 frozen blocks, LR=1.0 diverges. Unfreeze everything and it converges fine. The extra capacity from unfreezing absorbs the aggressive learning rate. It also shifts the optimal LR from 0.7 all the way up to 1.5.

Epoch scaling (4xH200, LR=1.0, freeze=0)

Epochs	Sliding BPB	TTT time
20	1.1183	569s
30	1.1076	854s

On 8xH100, each TTT epoch runs in ~16.6s (vs 28.5s on 4xH200), so 30 epochs fits within the 10-minute eval budget.

Architecture

Component	Detail
Layers	11
Dim	512
Heads	8 (4 KV, GQA)
MLP	3x, relu-squared
XSA	Last 4 layers
EMA	0.997
Late QAT	Int6 STE when lr_scale < 0.1
Value Embeddings	128-dim, 5 sets
BigramHash	6144 buckets
SmearGate	Learned token blending
Warmdown	1600 iterations
Seq length	2048 (train), 1024 (eval)
Sliding window	stride=64
Quantization	Int6 per-row + zstd-22

Training

Muon optimizer (matrix_lr=0.025, momentum=0.99 with warmup from 0.85)
AdamW for embeddings/scalars (WD=0.04)
Flash Attention v3 (Hopper) where available, SDPA fallback
6039 steps in 600s on 8xH100 (~99ms/step)

Evaluation

Three phases, all within the 10-minute eval budget:

Int6+zstd quantization roundtrip
TTT: SGD(lr=1.0, momentum=0.9), 30 epochs, all blocks unfrozen, score-first
Sliding window eval (stride=64, seq_len=1024)

Total eval time: ~591s (TTT 497s + sliding window 92s + roundtrip 2s)

Run Command

TTT_ENABLED=1 TTT_LR=1.0 TTT_EPOCHS=30 TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 \
VE_ENABLED=1 WARMDOWN_ITERS=1600 NUM_LAYERS=11 XSA_LAST_N=4 \
EMA_ENABLED=1 LATE_QAT=1 BIGRAM_VOCAB_SIZE=6144 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

How I Got Here

~20 hours on 4xH200, 54 experiments. Started from the 9L baseline and worked forward:

Baseline (9L, no extras): 1.1808
+11L, XSA, EMA, QAT: 1.1619
+Flash Attention v3: 1.1527
+Value Embeddings, warmdown tuning: 1.1521
+TTT (LR=0.01, 10ep, freeze=2): 1.1489
TTT LR sweep to 0.7: 1.1327
Unfreeze all blocks: 1.1255
LR=1.5, 20ep: 1.1110
30ep, LR=1.0: 1.1076
8xH100 (more training steps): 1.1124

Step 7 was where it got fun. Everything before that was incremental hill climbing. Unfreezing all blocks during TTT changed the optimization landscape enough that learning rates that previously diverged started converging, and the whole curve shifted.

Schrödinger's SOTA

This beats the merged leaderboard (1.1194) by 0.007 BPB. I haven't checked the pending PRs. Until they're merged, this is simultaneously a record and not a record, and I'm choosing to live in that superposition for a bit.

Credits

Built on the community's collective work, especially PR #414 (signalrush), PR #461 (Christopher-Lee-McClendon), and PR #549 (abaybektursun).

Inspired by PR openai#757 which found SGD LR=1.0 gives 16x better TTT gain than conventional LR=0.002. Key changes: - TTT_OPTIMIZER env var: "sgd" (default) or "adamw" - Default LR: 0.0001 -> 1.0 (SGD) - Default epochs: 4 -> 20 - Default freeze_blocks: 2 -> 0 (all unfrozen) PR openai#757 showed: freeze=0 + high LR converges fine, extra capacity absorbs aggressive learning rate. 20ep × ~16s = ~320s on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

QAT (training-time): - LATE_QAT=1, SOFT_ROUND_QAT=1 - QAT_THRESHOLD=0.5 (Early QAT, ~50% of warmdown) - QUANT_PERCENTILE=0.9999 (clip outliers) TTT (eval-time, PR openai#757 inspired): - SGD lr=1.0, 20 epochs, all blocks unfrozen - Score-first, Issue openai#677 compliant TTT_ENABLED still defaults to 0 — must explicitly enable with TTT_ENABLED=1 to activate TTT at eval time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fielding · 2026-03-25T21:20:08Z

Heads up: Sooo, the downside to not checking the PRs/issues while experimenting is that you miss important things. After seeing Issue #677, I'm reviewing my TTT implementation against the score-first requirement. The current version runs 30 epochs of TTT then re-scores with sliding window, which I believe violates the intent of the rules. Working on a legal single-pass score-first version now and will push updated logs once validated. Apologies for the oversight.

fielding · 2026-03-25T21:27:41Z

Update: Looking at PR #549's score-first implementation as the reference. Plan is to restructure TTT to score each chunk before training on it (same pattern as the merged SOTA). Currently validating on 4xH200, will push updated code and logs once confirmed (applied for a Runpod credit grant).

fielding · 2026-03-25T22:14:41Z

Converted to a draft while I figure out my compute situation. Sorry for the troubles.

Record: Aggressive SGD TTT (3-seed mean val_bpb=1.1124)

777ed67

fielding marked this pull request as draft March 25, 2026 22:14

fielding closed this Mar 27, 2026

fielding changed the title ~~Record: Aggressive SGD TTT (3-seed mean val_bpb=1.1124)~~ ignore: withdrew submission Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore: withdrew submission#757

ignore: withdrew submission#757
fielding wants to merge 1 commit intoopenai:mainfrom
fielding:submission/2026-03-25_TTT_Aggressive_SGD

fielding commented Mar 25, 2026

Uh oh!

fielding commented Mar 25, 2026 •

edited

Loading

Uh oh!

fielding commented Mar 25, 2026

Uh oh!

fielding commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fielding commented Mar 25, 2026

Aggressive SGD TTT (val_bpb: 1.1124)

Results

Approach

TTT Configuration

TTT LR Sweep (4xH200, 20 epochs, freeze=2)

Unfreezing all blocks (4xH200, 20 epochs)

Epoch scaling (4xH200, LR=1.0, freeze=0)

Architecture

Training

Evaluation

Run Command

How I Got Here

Schrödinger's SOTA

Credits

Uh oh!

fielding commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fielding commented Mar 25, 2026

Uh oh!

fielding commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fielding commented Mar 25, 2026 •

edited

Loading