Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval) by newjordan · Pull Request #445 · openai/parameter-golf

newjordan · 2026-03-22T17:36:07Z

Summary

val_bpb: 1.1236 (sliding window stride=64, 2-seed mean) | 15.59 MB (mean) | 8xH100 SXM
Builds on PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 baseline (1.1233) with one addition:
- Late Training Replay: 2-epoch replay of last 100 training batches at 10% LR before EMA finalization

Important: No TTT on validation data

This submission does NOT perform test-time training on evaluation/validation tokens. The "late replay" step replays training data only (buffered from the training loop) as a late-stage fine-tuning step before weight averaging. No eval tokens are seen before scoring. Fully compliant with issue #402.

Results (3 seeds, 8xH100 SXM)

Seed	Steps	Sliding BPB (s64)	Artifact
1337	6991	1.1232	15.68 MB
42	6994	1.1240	16.37 MB*
2024	6987	1.1239	15.50 MB

Mean (1337+2024): 1.1236 | Std: 0.0004

*Seed 42 artifact over size limit due to compression variance; BPB validates the approach.

How Late Training Replay Works

During the warmdown phase (last ~15% of training), the 100 most recent training batches are buffered. After training ends but before EMA weight averaging, these buffered training batches are replayed for 2 epochs at 10% of the base learning rate. This is strictly a training-data-only operation — equivalent to extending training by ~200 steps with a tiny learning rate. The model never sees validation tokens before scoring.

Test plan

3 seeds run on 8xH100 SXM, 600s each
Seeds 1337, 2024 under 16MB (15.68 MB, 15.50 MB)
Post-quant int6 roundtrip verified
Sliding window eval (stride=64) consistent across seeds (std=0.0004)
train_gpt.py under 1500 lines (1443)
No TTT on validation data — late replay uses training data only (per issue Invalid submissions due to information leakage during TTT #402)

🤖 Generated with Claude Code

…AttnRes

… gravity needs more steps

Seeds 42 and 2024 in progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 1337: 1.12319 BPB, 15.68 MB (qualifying) Seed 42: 1.12397 BPB, 16.37 MB (over size, validates BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 1337: 1.1232 BPB, 15.68 MB Seed 42: 1.1240 BPB, 16.37 MB (over size, validates BPB) Seed 2024: 1.1239 BPB, 15.50 MB 2-seed mean (1337+2024): 1.1236 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan · 2026-03-23T00:58:59Z

Updated per issue #402: renamed 'TTT Burst' to 'Late Training Replay' to clarify this does NOT perform test-time training on validation/eval tokens. The replay step buffers and replays training data only (last 100 training batches) as a late fine-tuning step before EMA finalization. No eval tokens are seen before scoring. Also closed our PR #390 (Sponge Bath) which did use TTT on eval tokens.

…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).

LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).

Late Training Replay (PR openai#445): buffer last 100 training batches during warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends. EMA updated during replay (critical detail from openai#445). ~50 lines. run_bestshot.sh stacks everything: MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5, Late Replay, VALUE_RESIDUAL=0

Octavian and others added 9 commits March 18, 2026 18:06

docs: fractal transformer research plan — weight sharing + gravity + …

6e503d9

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

73271f3

… gravity needs more steps

Merge branch 'openai:main' into main

ecdcedc

Record: 11L TTT Burst + GPTQ-15 + EMA + QAT (val_bpb=1.1232, seed 1337)

ab6c889

Seeds 42 and 2024 in progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update submission with accurate 2-seed results and correct train_gpt.py

131166a

Seed 1337: 1.12319 BPB, 15.68 MB (qualifying) Seed 42: 1.12397 BPB, 16.37 MB (over size, validates BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove accidentally staged files from submission branch

d9eb049

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add v2 self-distillation variant (val_bpb=1.1233, seed 1337)

ad9f7be

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove accidentally staged files

cbcd7f7

newjordan changed the title ~~Tiny bump: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1232)~~ Tiny bump electric boogaloo: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1236, 3-seed) Mar 22, 2026

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

newjordan mentioned this pull request Mar 23, 2026

Record: Sponge Bath — TTT 8ep eval-only improvement (val_bpb: 1.1295) #390

Closed

5 tasks

newjordan changed the title ~~Tiny bump electric boogaloo: 11L TTT Burst + EMA + GPTQ-lite (val_bpb=1.1236, 3-seed)~~ Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval) Mar 23, 2026

newjordan closed this Mar 23, 2026

This was referenced Mar 25, 2026

Podracing: 1.0461 BPB (3-seed mean) #674

Closed

Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706

Open

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)#445
newjordan wants to merge 9 commits intoopenai:mainfrom
newjordan:submission/ttt-burst-gptq15

newjordan commented Mar 22, 2026 •

edited

Loading

Uh oh!

newjordan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Important: No TTT on validation data

Results (3 seeds, 8xH100 SXM)

How Late Training Replay Works

Test plan

Uh oh!

newjordan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Mar 22, 2026 •

edited

Loading