Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups by dexhunter · Pull Request #967 · openai/parameter-golf

dexhunter · 2026-03-27T17:23:46Z

Summary

val_bpb: 1.0450 (3-seed mean, std 0.012) | eval <545s | artifact <15.7MB

Built on PR #720 by @agalimova. Improves on our earlier PR #953 (1.0722 → 1.0450).

Key Change: SGD Optimizer for TTT

Switched TTT from AdamW (lr=0.0005) to SGD momentum=0.9 (lr=0.002). This single change accounts for -0.041 BPB. SGD provides more aggressive parameter updates that better exploit the short adaptation window.

3-Seed Results

Seed	TTT BPB	Eval Time	Artifact
1337	1.0312	540s	15.57MB
42	1.0503	533s	15.67MB
2025	1.0535	544s	15.15MB
Mean	1.0450

Ablation

Config	BPB	Notes
Full (SGD TTT + mixer)	1.0312	This submission
AdamW TTT + mixer	1.0726	Previous PR #953
No mixer (AdamW TTT only)	1.1559	Mixer is essential
No TTT (sliding only)	1.1210	Pure neural baseline

Run

export NCCL_NET=Socket SKIP_SLIDING=1 TTT_OPTIMIZER=sgd TTT_LR=0.002
SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

Compliance

- 3 seeds, all train ≤600s, eval ≤600s, artifact ≤16MB
- Score-first legal TTT + backward-looking HedgeMixer
- No external data access during eval

Test Plan

- 3-seed verification (1337, 42, 2025)
- Ablation: no-mixer, no-TTT baselines measured
- All final_int6_ttt_exact metrics emitted

@agalimova

Built on PR openai#720 by @agalimova. Key change: SGD TTT (lr=0.002, momentum=0.9) replaces AdamW, producing -0.041 BPB improvement. 3-seed results: Seed 1337: 1.0312 BPB (540s eval) Seed 42: 1.0503 BPB (533s eval) Seed 2025: 1.0535 BPB (544s eval) Mean: 1.0450 ± 0.012 All seeds: train <600s, eval <600s, artifact <16MB. Score-first legal TTT + backward-looking HedgeMixer.

dexhunter · 2026-03-28T08:16:50Z

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

dexhunter mentioned this pull request Mar 28, 2026

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR #995

Closed

dexhunter closed this Mar 28, 2026

dexhunter mentioned this pull request Mar 29, 2026

1.1085 BPB: JEPA + AdamW TTT + Full GPTQ + FA3 + LZMA #1006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups#967

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups#967
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-ttt-hedgemixer-1.0450

dexhunter commented Mar 27, 2026

Uh oh!

dexhunter commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Mar 27, 2026

Summary

Key Change: SGD Optimizer for TTT

3-Seed Results

Ablation

Run

Uh oh!

dexhunter commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dexhunter commented Mar 28, 2026 •

edited

Loading