Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups by dexhunter · Pull Request #953 · openai/parameter-golf

dexhunter · 2026-03-27T13:31:45Z

Built on PR #720 by @agalimova. Acknowledges the HedgeMixer framework from that submission.

val_bpb: 1.0722 (3-seed mean, std 0.009) | artifact <15.7MB | eval <546s

3-Seed Results

Seed	TTT BPB	Eval Time	Artifact
1337	1.0726	537s	15.66MB
42	1.0635	546s	15.37MB
2025	1.0806	531s	15.66MB
Mean	1.0722

What's New (vs PR #720)

Novel TTT recipe that improves test-time adaptation:

Per-layer LR groups: 3x LR for output projections (attn.proj, mlp.proj), 0.5x for input (mlp.fc) — targets the highest-value adaptation parameters
Cosine LR schedule within TTT: starts high, anneals to prevent overfitting in later chunks
4 TTT epochs (vs 3) + freeze 1 block (vs 2) — more adaptation capacity
Skip standalone sliding eval to reclaim 86s for the extra epoch

Ablation

Config	BPB	Δ vs full
No TTT (sliding window only)	1.1210	+0.049
TTT only, no mixer scoring	1.1559	+0.083
Mixer scoring, no n-gram cache	1.0790	+0.006
Full (mixer + per-layer LR TTT)	1.0726	—

The mixer scoring framework and TTT adaptation work as a joint system — the Hedge-weighted expert ensemble provides calibrated probability estimates that make chunk-sequential TTT effective.

Rule Compliance (per Issue #677)

✅ Score-first TTT: Each chunk scored under inference_mode() before any model weight update. Every token evaluated before the model trains on it.
✅ Backward-looking mixer: N-gram count tables built exclusively from already-scored tokens. No token's own data is used in its prediction.
✅ No validation data during training: Mixer starts empty, builds statistics only during eval phase.
✅ No training data during evaluation: No access to fineweb_train_* shards during eval.
✅ No oracle/hindsight selection: Mixer blends expert predictions BEFORE observing ground truth. Hedge weights updated after scoring, not used to select predictions.
✅ No GPTQ calibration during eval: GPTQ calibration runs during training phase only.
✅ Artifact self-contained: All code in single optimize.py, model weights in compressed artifact.
✅ Time budgets: Train <600s, eval <546s, total <1200s.
✅ Artifact size: <15.7MB (<16MB limit).

Base Architecture (from PR #720)

11L, 512d, 8 heads / 8 KV heads, MLP 3.5x, LeakyReLU(0.5)²
XSA on all 11 layers, BigramHash(6144, dim=128), SmearGate
EMA(0.997) + SWA, int5 GPTQ-lite + zstd compression
Parameter Banking with Parallel Muon optimizer

Test Plan

3-seed verification complete (1337, 42, 2025)
All runs emit final_int6_ttt_exact metrics
Artifact size < 16,000,000 bytes for all seeds
Eval time < 600s for all seeds
Ablation run confirming contribution breakdown

New record submission: per-layer LR groups + cosine TTT schedule on XSA-all + HedgeMixer backbone. 3-seed results (seeds 1337, 42, 2025): Mean: 1.1069 val_bpb (std 0.0202) Best: 1.0952 (seed 1337) Worst: 1.1302 (seed 2025) Key novel contributions: - Per-layer LR groups for TTT (3x proj, 0.5x fc) - Cosine LR schedule within TTT evaluation - Reduced block freezing (1 vs 2) All seeds: train <600s, eval <600s, artifact <16MB. Score-first legal TTT + backward-looking HedgeMixer.

@agalimova

Built on PR openai#720 by @agalimova. Novel TTT recipe: - Per-layer LR groups (3x proj, 0.5x fc) - Cosine LR schedule within TTT - 4 epochs (vs 3), freeze 1 block (vs 2) - Skip sliding eval to reclaim time for extra epoch 3-seed results: Seed 1337: 1.0726 BPB (537s eval) Seed 42: 1.0635 BPB (546s eval) Seed 2025: 1.0806 BPB (531s eval) Mean: 1.0722 ± 0.009 All seeds: train <600s, eval <600s, artifact <16MB. Beats merged SOTA (1.1194) by 0.047.

dexhunter · 2026-03-27T15:08:57Z

Ablation note (addendum):

We ran a no-mixer ablation (same TTT recipe, USE_CACHE=0) to isolate each component's contribution:

Config	TTT BPB	Δ vs Full	Eval Time
Full (mixer + per-layer LR)	1.0726	—	537s
No mixer (per-layer LR only)	1.0790	+0.0064	511s

The per-layer LR recipe is the main driver of the improvement. The HedgeMixer adds a modest −0.0064 BPB on top. Even without the mixer, the improved TTT recipe alone beats merged SOTA (1.1194) by 0.040 BPB.

dexhunter · 2026-03-28T08:16:49Z

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

Based on forensic analysis of PRs openai#953/openai#995 (legal TTT improvements only): - Per-layer LR groups: 3x for output projections, 0.5x for input, 1x rest - Polyak averaging (decay=0.998): score with smoothed weights, train with raw - Cosine LR schedule applied per-group - New env vars: TTT_POLYAK_DECAY, TTT_PROJ_LR_MULT, TTT_INPUT_LR_MULT Run with improved recipe: TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.95 TTT_EPOCHS=4 Ablation from PR openai#953: TTT-only (no illegal mixer) = 1.079 BPB. Our current: 1.1190. Expected with improved recipe: ~1.08-1.10. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dexhunter added 2 commits March 27, 2026 13:30

dexhunter changed the title ~~Record: 1.1069 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups~~ Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups Mar 27, 2026

This was referenced Mar 27, 2026

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups #967

Closed

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR #995

Closed

dexhunter closed this Mar 28, 2026

dexhunter mentioned this pull request Mar 29, 2026

1.1085 BPB: JEPA + AdamW TTT + Full GPTQ + FA3 + LZMA #1006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/2026-03-27-improved-ttt-hedgemixer

dexhunter commented Mar 27, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 27, 2026

Uh oh!

dexhunter commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

3-Seed Results

What's New (vs PR #720)

Ablation

Rule Compliance (per Issue #677)

Base Architecture (from PR #720)

Test Plan

Uh oh!

dexhunter commented Mar 27, 2026

Uh oh!

dexhunter commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dexhunter commented Mar 27, 2026 •

edited

Loading

dexhunter commented Mar 28, 2026 •

edited

Loading