Skip to content

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953

Closed
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/2026-03-27-improved-ttt-hedgemixer
Closed

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/2026-03-27-improved-ttt-hedgemixer

Conversation

@dexhunter
Copy link
Copy Markdown

@dexhunter dexhunter commented Mar 27, 2026

Built on PR #720 by @agalimova. Acknowledges the HedgeMixer framework from that submission.

val_bpb: 1.0722 (3-seed mean, std 0.009) | artifact <15.7MB | eval <546s

3-Seed Results

Seed TTT BPB Eval Time Artifact
1337 1.0726 537s 15.66MB
42 1.0635 546s 15.37MB
2025 1.0806 531s 15.66MB
Mean 1.0722

What's New (vs PR #720)

Novel TTT recipe that improves test-time adaptation:

  1. Per-layer LR groups: 3x LR for output projections (attn.proj, mlp.proj), 0.5x for input (mlp.fc) — targets the highest-value adaptation parameters
  2. Cosine LR schedule within TTT: starts high, anneals to prevent overfitting in later chunks
  3. 4 TTT epochs (vs 3) + freeze 1 block (vs 2) — more adaptation capacity
  4. Skip standalone sliding eval to reclaim 86s for the extra epoch

Ablation

Config BPB Δ vs full
No TTT (sliding window only) 1.1210 +0.049
TTT only, no mixer scoring 1.1559 +0.083
Mixer scoring, no n-gram cache 1.0790 +0.006
Full (mixer + per-layer LR TTT) 1.0726

The mixer scoring framework and TTT adaptation work as a joint system — the Hedge-weighted expert ensemble provides calibrated probability estimates that make chunk-sequential TTT effective.

Rule Compliance (per Issue #677)

  • Score-first TTT: Each chunk scored under inference_mode() before any model weight update. Every token evaluated before the model trains on it.
  • Backward-looking mixer: N-gram count tables built exclusively from already-scored tokens. No token's own data is used in its prediction.
  • No validation data during training: Mixer starts empty, builds statistics only during eval phase.
  • No training data during evaluation: No access to fineweb_train_* shards during eval.
  • No oracle/hindsight selection: Mixer blends expert predictions BEFORE observing ground truth. Hedge weights updated after scoring, not used to select predictions.
  • No GPTQ calibration during eval: GPTQ calibration runs during training phase only.
  • Artifact self-contained: All code in single optimize.py, model weights in compressed artifact.
  • Time budgets: Train <600s, eval <546s, total <1200s.
  • Artifact size: <15.7MB (<16MB limit).

Base Architecture (from PR #720)

  • 11L, 512d, 8 heads / 8 KV heads, MLP 3.5x, LeakyReLU(0.5)²
  • XSA on all 11 layers, BigramHash(6144, dim=128), SmearGate
  • EMA(0.997) + SWA, int5 GPTQ-lite + zstd compression
  • Parameter Banking with Parallel Muon optimizer

Test Plan

  • 3-seed verification complete (1337, 42, 2025)
  • All runs emit final_int6_ttt_exact metrics
  • Artifact size < 16,000,000 bytes for all seeds
  • Eval time < 600s for all seeds
  • Ablation run confirming contribution breakdown

New record submission: per-layer LR groups + cosine TTT schedule on
XSA-all + HedgeMixer backbone.

3-seed results (seeds 1337, 42, 2025):
  Mean: 1.1069 val_bpb (std 0.0202)
  Best: 1.0952 (seed 1337)
  Worst: 1.1302 (seed 2025)

Key novel contributions:
- Per-layer LR groups for TTT (3x proj, 0.5x fc)
- Cosine LR schedule within TTT evaluation
- Reduced block freezing (1 vs 2)

All seeds: train <600s, eval <600s, artifact <16MB.
Score-first legal TTT + backward-looking HedgeMixer.
Built on PR openai#720 by @agalimova. Novel TTT recipe:
- Per-layer LR groups (3x proj, 0.5x fc)
- Cosine LR schedule within TTT
- 4 epochs (vs 3), freeze 1 block (vs 2)
- Skip sliding eval to reclaim time for extra epoch

3-seed results:
  Seed 1337: 1.0726 BPB (537s eval)
  Seed   42: 1.0635 BPB (546s eval)
  Seed 2025: 1.0806 BPB (531s eval)
  Mean:      1.0722 ± 0.009

All seeds: train <600s, eval <600s, artifact <16MB.
Beats merged SOTA (1.1194) by 0.047.
@dexhunter dexhunter changed the title Record: 1.1069 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups Mar 27, 2026
@dexhunter
Copy link
Copy Markdown
Author

Ablation note (addendum):

We ran a no-mixer ablation (same TTT recipe, USE_CACHE=0) to isolate each component's contribution:

Config TTT BPB Δ vs Full Eval Time
Full (mixer + per-layer LR) 1.0726 537s
No mixer (per-layer LR only) 1.0790 +0.0064 511s

The per-layer LR recipe is the main driver of the improvement. The HedgeMixer adds a modest −0.0064 BPB on top. Even without the mixer, the improved TTT recipe alone beats merged SOTA (1.1194) by 0.040 BPB.

@dexhunter
Copy link
Copy Markdown
Author

dexhunter commented Mar 28, 2026

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

@dexhunter dexhunter closed this Mar 28, 2026
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 29, 2026
Based on forensic analysis of PRs openai#953/openai#995 (legal TTT improvements only):
- Per-layer LR groups: 3x for output projections, 0.5x for input, 1x rest
- Polyak averaging (decay=0.998): score with smoothed weights, train with raw
- Cosine LR schedule applied per-group
- New env vars: TTT_POLYAK_DECAY, TTT_PROJ_LR_MULT, TTT_INPUT_LR_MULT

Run with improved recipe:
  TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.95 TTT_EPOCHS=4

Ablation from PR openai#953: TTT-only (no illegal mixer) = 1.079 BPB.
Our current: 1.1190. Expected with improved recipe: ~1.08-1.10.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant