Conversation
Inspired by PR openai#757 which found SGD LR=1.0 gives 16x better TTT gain than conventional LR=0.002. Key changes: - TTT_OPTIMIZER env var: "sgd" (default) or "adamw" - Default LR: 0.0001 -> 1.0 (SGD) - Default epochs: 4 -> 20 - Default freeze_blocks: 2 -> 0 (all unfrozen) PR openai#757 showed: freeze=0 + high LR converges fine, extra capacity absorbs aggressive learning rate. 20ep × ~16s = ~320s on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QAT (training-time): - LATE_QAT=1, SOFT_ROUND_QAT=1 - QAT_THRESHOLD=0.5 (Early QAT, ~50% of warmdown) - QUANT_PERCENTILE=0.9999 (clip outliers) TTT (eval-time, PR openai#757 inspired): - SGD lr=1.0, 20 epochs, all blocks unfrozen - Score-first, Issue openai#677 compliant TTT_ENABLED still defaults to 0 — must explicitly enable with TTT_ENABLED=1 to activate TTT at eval time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Heads up: Sooo, the downside to not checking the PRs/issues while experimenting is that you miss important things. After seeing Issue #677, I'm reviewing my TTT implementation against the score-first requirement. The current version runs 30 epochs of TTT then re-scores with sliding window, which I believe violates the intent of the rules. Working on a legal single-pass score-first version now and will push updated logs once validated. Apologies for the oversight. |
|
Update: Looking at PR #549's score-first implementation as the reference. Plan is to restructure TTT to score each chunk before training on it (same pattern as the merged SOTA). Currently validating on 4xH200, will push updated code and logs once confirmed (applied for a Runpod credit grant). |
|
Converted to a draft while I figure out my compute situation. Sorry for the troubles. |
Aggressive SGD TTT (val_bpb: 1.1124)
3-seed mean val_bpb: 1.1124 (std=0.0008) | 15.4 MB artifact | 8xH100 SXM, 600s training + 591s eval
Results
Approach
Standard 11L architecture, nothing exotic on the model side. The interesting part is the TTT. The base model trains for 600s, then TTT adapts all weights via SGD for 30 epochs on the validation data (score-first protocol).
The conventional wisdom is TTT at LR=0.002 for 3 epochs. We ran 20+ configurations on 4xH200 and found that cranking the LR to 1.0 and unfreezing every block turns a -0.0025 BPB technique into a -0.041 BPB technique. That's a 16x improvement from the same underlying method. It's like finding out your car has a sport mode you never tried.
TTT Configuration
I swept this on 4xH200 before validating on 8xH100. The sweep told the whole story.
TTT LR Sweep (4xH200, 20 epochs, freeze=2)
BPB just keeps getting better as LR goes up... until it doesn't. Peak at 0.7 with 2 frozen blocks.
Unfreezing all blocks (4xH200, 20 epochs)
This was the breakthrough. With 2 frozen blocks, LR=1.0 diverges. Unfreeze everything and it converges fine. The extra capacity from unfreezing absorbs the aggressive learning rate. It also shifts the optimal LR from 0.7 all the way up to 1.5.
Epoch scaling (4xH200, LR=1.0, freeze=0)
On 8xH100, each TTT epoch runs in ~16.6s (vs 28.5s on 4xH200), so 30 epochs fits within the 10-minute eval budget.
Architecture
Training
Evaluation
Three phases, all within the 10-minute eval budget:
Total eval time: ~591s (TTT 497s + sliding window 92s + roundtrip 2s)
Run Command
How I Got Here
~20 hours on 4xH200, 54 experiments. Started from the 9L baseline and worked forward:
Step 7 was where it got fun. Everything before that was incremental hill climbing. Unfreezing all blocks during TTT changed the optimization landscape enough that learning rates that previously diverged started converging, and the whole curve shifted.
Schrödinger's SOTA
This beats the merged leaderboard (1.1194) by 0.007 BPB. I haven't checked the pending PRs. Until they're merged, this is simultaneously a record and not a record, and I'm choosing to live in that superposition for a bit.
Credits
Built on the community's collective work, especially PR #414 (signalrush), PR #461 (Christopher-Lee-McClendon), and PR #549 (abaybektursun).