Skip to content

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups#967

Closed
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-ttt-hedgemixer-1.0450
Closed

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups#967
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-ttt-hedgemixer-1.0450

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

val_bpb: 1.0450 (3-seed mean, std 0.012) | eval <545s | artifact <15.7MB

Built on PR #720 by @agalimova. Improves on our earlier PR #953 (1.0722 → 1.0450).

Key Change: SGD Optimizer for TTT

Switched TTT from AdamW (lr=0.0005) to SGD momentum=0.9 (lr=0.002). This single change accounts for -0.041 BPB. SGD provides more aggressive parameter updates that better exploit the short adaptation window.

3-Seed Results

Seed TTT BPB Eval Time Artifact
1337 1.0312 540s 15.57MB
42 1.0503 533s 15.67MB
2025 1.0535 544s 15.15MB
Mean 1.0450

Ablation

Config BPB Notes
Full (SGD TTT + mixer) 1.0312 This submission
AdamW TTT + mixer 1.0726 Previous PR #953
No mixer (AdamW TTT only) 1.1559 Mixer is essential
No TTT (sliding only) 1.1210 Pure neural baseline

Run

export NCCL_NET=Socket SKIP_SLIDING=1 TTT_OPTIMIZER=sgd TTT_LR=0.002
SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

Compliance

- 3 seeds, all train ≤600s, eval ≤600s, artifact ≤16MB
- Score-first legal TTT + backward-looking HedgeMixer
- No external data access during eval

Test Plan

- 3-seed verification (1337, 42, 2025)
- Ablation: no-mixer, no-TTT baselines measured
- All final_int6_ttt_exact metrics emitted

Built on PR openai#720 by @agalimova. Key change: SGD TTT (lr=0.002,
momentum=0.9) replaces AdamW, producing -0.041 BPB improvement.

3-seed results:
  Seed 1337: 1.0312 BPB (540s eval)
  Seed   42: 1.0503 BPB (533s eval)
  Seed 2025: 1.0535 BPB (544s eval)
  Mean:      1.0450 ± 0.012

All seeds: train <600s, eval <600s, artifact <16MB.
Score-first legal TTT + backward-looking HedgeMixer.
@dexhunter
Copy link
Copy Markdown
Author

dexhunter commented Mar 28, 2026

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant