Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445
Closed
newjordan wants to merge 9 commits intoopenai:mainfrom
Closed
Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445newjordan wants to merge 9 commits intoopenai:mainfrom
newjordan wants to merge 9 commits intoopenai:mainfrom
Conversation
… gravity needs more steps
Seeds 42 and 2024 in progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.12319 BPB, 15.68 MB (qualifying) Seed 42: 1.12397 BPB, 16.37 MB (over size, validates BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.1232 BPB, 15.68 MB Seed 42: 1.1240 BPB, 16.37 MB (over size, validates BPB) Seed 2024: 1.1239 BPB, 15.50 MB 2-seed mean (1337+2024): 1.1236 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
Author
|
Updated per issue #402: renamed 'TTT Burst' to 'Late Training Replay' to clarify this does NOT perform test-time training on validation/eval tokens. The replay step buffers and replays training data only (last 100 training batches) as a late fine-tuning step before EMA finalization. No eval tokens are seen before scoring. Also closed our PR #390 (Sponge Bath) which did use TTT on eval tokens. |
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Late Training Replay (PR openai#445): buffer last 100 training batches during warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends. EMA updated during replay (critical detail from openai#445). ~50 lines. run_bestshot.sh stacks everything: MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5, Late Replay, VALUE_RESIDUAL=0
This was referenced Mar 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Important: No TTT on validation data
This submission does NOT perform test-time training on evaluation/validation tokens. The "late replay" step replays training data only (buffered from the training loop) as a late-stage fine-tuning step before weight averaging. No eval tokens are seen before scoring. Fully compliant with issue #402.
Results (3 seeds, 8xH100 SXM)
Mean (1337+2024): 1.1236 | Std: 0.0004
*Seed 42 artifact over size limit due to compression variance; BPB validates the approach.
How Late Training Replay Works
During the warmdown phase (last ~15% of training), the 100 most recent training batches are buffered. After training ends but before EMA weight averaging, these buffered training batches are replayed for 2 epochs at 10% of the base learning rate. This is strictly a training-data-only operation — equivalent to extending training by ~200 steps with a tiny learning rate. The model never sees validation tokens before scoring.
Test plan
🤖 Generated with Claude Code