Skip to content

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445

Closed
newjordan wants to merge 9 commits intoopenai:mainfrom
newjordan:submission/ttt-burst-gptq15
Closed

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445
newjordan wants to merge 9 commits intoopenai:mainfrom
newjordan:submission/ttt-burst-gptq15

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 22, 2026

Summary

Important: No TTT on validation data

This submission does NOT perform test-time training on evaluation/validation tokens. The "late replay" step replays training data only (buffered from the training loop) as a late-stage fine-tuning step before weight averaging. No eval tokens are seen before scoring. Fully compliant with issue #402.

Results (3 seeds, 8xH100 SXM)

Seed Steps Sliding BPB (s64) Artifact
1337 6991 1.1232 15.68 MB
42 6994 1.1240 16.37 MB*
2024 6987 1.1239 15.50 MB

Mean (1337+2024): 1.1236 | Std: 0.0004

*Seed 42 artifact over size limit due to compression variance; BPB validates the approach.

How Late Training Replay Works

During the warmdown phase (last ~15% of training), the 100 most recent training batches are buffered. After training ends but before EMA weight averaging, these buffered training batches are replayed for 2 epochs at 10% of the base learning rate. This is strictly a training-data-only operation — equivalent to extending training by ~200 steps with a tiny learning rate. The model never sees validation tokens before scoring.

Test plan

  • 3 seeds run on 8xH100 SXM, 600s each
  • Seeds 1337, 2024 under 16MB (15.68 MB, 15.50 MB)
  • Post-quant int6 roundtrip verified
  • Sliding window eval (stride=64) consistent across seeds (std=0.0004)
  • train_gpt.py under 1500 lines (1443)
  • No TTT on validation data — late replay uses training data only (per issue Invalid submissions due to information leakage during TTT #402)

🤖 Generated with Claude Code

Octavian and others added 9 commits March 18, 2026 18:06
Seeds 42 and 2024 in progress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.12319 BPB, 15.68 MB (qualifying)
Seed 42: 1.12397 BPB, 16.37 MB (over size, validates BPB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.1232 BPB, 15.68 MB
Seed 42: 1.1240 BPB, 16.37 MB (over size, validates BPB)
Seed 2024: 1.1239 BPB, 15.50 MB
2-seed mean (1337+2024): 1.1236

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newjordan newjordan changed the title Tiny bump: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1232) Tiny bump electric boogaloo: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1236, 3-seed) Mar 22, 2026
@newjordan newjordan changed the title Tiny bump electric boogaloo: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1236, 3-seed) Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval) Mar 23, 2026
@newjordan
Copy link
Copy Markdown
Author

Updated per issue #402: renamed 'TTT Burst' to 'Late Training Replay' to clarify this does NOT perform test-time training on validation/eval tokens. The replay step buffers and replays training data only (last 100 training batches) as a late fine-tuning step before EMA finalization. No eval tokens are seen before scoring. Also closed our PR #390 (Sponge Bath) which did use TTT on eval tokens.

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
…iques

Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792
config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236.

Novel technique analysis: DG Attention (differential values), BitNet b1.58
(ternary weights + depth recurrence), arithmetic coding (replaces zstd-22),
LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2.
Addresses dead neuron problem. LEAKY_RELU=1 env var.

run_no_ttt_best.sh: run3 base + three free lunches:
  - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB)
  - LeakyReLU(0.5)^2 (zero params, -0.003 BPB)
  - QAT=1 (run5 proved negative quant gap)

Drops sigmoid gates and decoder 2x LR (run6 showed they hurt).
Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Late Training Replay (PR openai#445): buffer last 100 training batches during
warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends.
EMA updated during replay (critical detail from openai#445). ~50 lines.

run_bestshot.sh stacks everything:
  MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5,
  Late Replay, VALUE_RESIDUAL=0
@newjordan newjordan closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant