Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953
Closed
dexhunter wants to merge 2 commits intoopenai:mainfrom
Closed
Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups#953dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter wants to merge 2 commits intoopenai:mainfrom
Conversation
New record submission: per-layer LR groups + cosine TTT schedule on XSA-all + HedgeMixer backbone. 3-seed results (seeds 1337, 42, 2025): Mean: 1.1069 val_bpb (std 0.0202) Best: 1.0952 (seed 1337) Worst: 1.1302 (seed 2025) Key novel contributions: - Per-layer LR groups for TTT (3x proj, 0.5x fc) - Cosine LR schedule within TTT evaluation - Reduced block freezing (1 vs 2) All seeds: train <600s, eval <600s, artifact <16MB. Score-first legal TTT + backward-looking HedgeMixer.
Built on PR openai#720 by @agalimova. Novel TTT recipe: - Per-layer LR groups (3x proj, 0.5x fc) - Cosine LR schedule within TTT - 4 epochs (vs 3), freeze 1 block (vs 2) - Skip sliding eval to reclaim time for extra epoch 3-seed results: Seed 1337: 1.0726 BPB (537s eval) Seed 42: 1.0635 BPB (546s eval) Seed 2025: 1.0806 BPB (531s eval) Mean: 1.0722 ± 0.009 All seeds: train <600s, eval <600s, artifact <16MB. Beats merged SOTA (1.1194) by 0.047.
Author
|
Ablation note (addendum): We ran a no-mixer ablation (same TTT recipe,
The per-layer LR recipe is the main driver of the improvement. The HedgeMixer adds a modest −0.0064 BPB on top. Even without the mixer, the improved TTT recipe alone beats merged SOTA (1.1194) by 0.040 BPB. |
This was referenced Mar 27, 2026
Author
|
Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring. |
manfromnowhere143
added a commit
to manfromnowhere143/parameter-golf
that referenced
this pull request
Mar 29, 2026
Based on forensic analysis of PRs openai#953/openai#995 (legal TTT improvements only): - Per-layer LR groups: 3x for output projections, 0.5x for input, 1x rest - Polyak averaging (decay=0.998): score with smoothed weights, train with raw - Cosine LR schedule applied per-group - New env vars: TTT_POLYAK_DECAY, TTT_PROJ_LR_MULT, TTT_INPUT_LR_MULT Run with improved recipe: TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.95 TTT_EPOCHS=4 Ablation from PR openai#953: TTT-only (no illegal mixer) = 1.079 BPB. Our current: 1.1190. Expected with improved recipe: ~1.08-1.10. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Built on PR #720 by @agalimova. Acknowledges the HedgeMixer framework from that submission.
val_bpb: 1.0722 (3-seed mean, std 0.009) | artifact <15.7MB | eval <546s
3-Seed Results
What's New (vs PR #720)
Novel TTT recipe that improves test-time adaptation:
Ablation
The mixer scoring framework and TTT adaptation work as a joint system — the Hedge-weighted expert ensemble provides calibrated probability estimates that make chunk-sequential TTT effective.
Rule Compliance (per Issue #677)
inference_mode()before any model weight update. Every token evaluated before the model trains on it.Base Architecture (from PR #720)
Test Plan
final_int6_ttt_exactmetrics