[record bpb=1.195] sliding window + LoRA TTT by samacqua · Pull Request #77 · openai/parameter-golf

samacqua · 2026-03-19T11:57:16Z

This record captures LoRA TTT: the naive baseline model + document masking + sliding window + LoRA test-time training at evaluation.

Method

Training is identical to the naive baseline.

Evaluation adds per-document LoRA test-time training (TTT). For each document in the validation set:

Find document boundaries using BOS tokens
Split the document into overlapping chunks (chunk_size=256 within eval_seq_len=1024 context windows)
For each chunk, score it (accumulate loss/bytes for BPB), then train rank-8 LoRA adapters on that chunk's loss (so you only train on the context -- no leakage)
Reset LoRA parameters between documents (no leakake across documents)

Documents are batched (batch_size=64) and sorted by length for efficiency. The LoRA adapters target lm_head, c_q, and c_v projections in all transformer blocks. A single Adam optimizer with lr=0.01, betas=(0.9, 0.95) trains all LoRA parameters with one gradient step per chunk.

Notes

This is very similar to a record I submmited to the modded nano-gpt speedrun repo.
The major addition is to make the test-time training ~5x faster by using LoRAs: this let's you have per-sequence adaptation (no leaking between validation sequences) while still batching.

This is not a heavily optimized run: I just wanted to plant the TTT seed.
It uses ~1/10th of the evaluation budget.

Ablations

The majority of this improvement doesn't come from the TTT itself, but from
1). Only conditioning on the current document
2). Using a sliding window at eval

Condition	val_loss	val_bpb	Delta bpb
Baseline (cross-doc, flat stream)	2.0731	1.2278	—
+ Doc-isolated	2.0561	1.2168	-0.0110
+ Stride (chunk=256)	2.0177	1.1941	-0.0337
+ LoRA TTT	2.0126	1.1910	-0.0368

Results

Validated on the full 50k-document fineweb_val split. Submitting at bpb=1.195.

bpb: [1.1927, 1.1935, 1.1921, 1.1929]
mean: 1.1928
std: 0.0005
p-value < 1.195: 0.00234486

Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Included files

train_gpt.py
train_v*.txt (note that train_v0.txt is on 2xH100)
submission.json

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq · 2026-03-19T16:53:12Z

Awesome! I'll try to review and get this in ASAP.

1. NUM_LAYERS=7 (from 8) to fit int6 artifact under 16MB 2. TTT now freezes all params except embedding + last block, reducing backward pass memory by ~85%. This is closer to PR openai#77's LoRA TTT approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evan-conway · 2026-03-19T17:19:04Z

This is really cool! One interesting thing to experiment with would be changing the training dataset, since this helps remove some of the train-test mismatch issues you get from doing that. In NanoChat there's been some success using ClimbMix, so that'd be fun to test out. Karpathy has a shuffle here.

FI-Mihej · 2026-03-20T01:56:51Z

@0hq Is it intentional that this PR not only creates a subfolder with the code, but also globally modifies the reference train_gpt.py in the root of the repository for everyone else?

Explores unigram cache language model (Grave et al. 2017) combined with per-document LoRA test-time training as eval-time techniques. Key findings: - LoRA TTT: -0.003 BPB (confirms PR openai#77's ablation) - Cache LM: negative result on FineWeb (λ=0.02 hurts by +0.002) - Web text burstiness is too low for unigram cache to help Model: v8192 7L GQA + MTP, 2000 steps on 1xA100 (1/10th budget) Best: 1.2529 BPB (LoRA TTT only), 1.2544 (with cache) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551). Adds four improvements targeting quantization fidelity and eval-time adaptation: - KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights - Tanh weight reparameterization bounding effective weights to [-1,1] - Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection - Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@timowhite88

Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@timowhite88

Fixed TTT by using compiled model (same as training) instead of creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4. Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@timowhite88

Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@timowhite88

Fixed TTT by using compiled model (same as training) instead of creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4. Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@timowhite88

Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

Multi-epoch TTT was ruled invalid by organizers (PR openai#568 closed). Now: score each chunk BEFORE training, single pass, each token scored exactly once. Matches PR openai#77 pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

commit ttt record

9fed66c

samacqua changed the title ~~[record bpb=1.195] LoRA TTT~~ [record bpb=1.195] sliding window + LoRA TTT Mar 19, 2026

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Closed

0hq added the record submission ready for review label Mar 19, 2026

0hq approved these changes Mar 19, 2026

View reviewed changes

0hq merged commit bd2463a into openai:main Mar 19, 2026

notapplica mentioned this pull request Mar 19, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

mrdavtan mentioned this pull request Mar 20, 2026

Non-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045) #151

Closed

4 tasks

mrdavtan mentioned this pull request Mar 20, 2026

Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64 #199

Closed

4 tasks

timowhite88 mentioned this pull request Mar 20, 2026

Add TTT (Test-Time Training) submission: 1.1767 BPB #152

Closed

Dannybc123 mentioned this pull request Mar 20, 2026

Non-record: TTT + QAT on Consumer GPU (val_bpb=1.5382) #263

Open

machdragon mentioned this pull request Mar 20, 2026

KURE/R2 + Tanh Reparam + Parallel EMA + LoRA TTT machdragon/parameter-golf#4

Open

15 tasks

scottspace pushed a commit to scottspace/parameter-golf that referenced this pull request Mar 21, 2026

commit ttt record (openai#77)

d1d561b

leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026

commit ttt record (openai#77)

d8ae405

mrdavtan mentioned this pull request Mar 21, 2026

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization #212

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 22, 2026

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

MatoTeziTanka mentioned this pull request Mar 23, 2026

Record: PROTEUS v7 — 11L INT6 + LoRA TTT (mean val_bpb=0.9512, 3 seeds) #512

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

samacqua mentioned this pull request Mar 23, 2026

Invalid submissions due to information leakage during TTT #402

Open

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

LoquiAuris mentioned this pull request Mar 23, 2026

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds) #548

Closed

MatoTeziTanka mentioned this pull request Mar 23, 2026

Record: PROTEUS v8 — 11L INT6 + LoRA TTT 5ep cosine (mean val_bpb=0.7853, 4 seeds) #568

Closed

LoquiAuris mentioned this pull request Mar 24, 2026

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100) #595

Closed

MatoTeziTanka mentioned this pull request Mar 24, 2026

PROTEUS v9 — 11L INT6 + single-epoch LoRA TTT (mean val_bpb=1.1526, 3 seeds) #633

Open

minh-stakc mentioned this pull request Mar 24, 2026

Record: 11L + Score-Every-Epoch LoRA TTT 5ep (3-seed mean val_bpb=0.8173) #642

Closed

4 tasks

nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026

commit ttt record (openai#77)

e0386d7

aerosta mentioned this pull request Mar 28, 2026

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed #993

Closed

This was referenced Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) #1170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[record bpb=1.195] sliding window + LoRA TTT#77

[record bpb=1.195] sliding window + LoRA TTT#77
0hq merged 1 commit intoopenai:mainfrom
samacqua:main

samacqua commented Mar 19, 2026 •

edited

Loading

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

evan-conway commented Mar 19, 2026

Uh oh!

FI-Mihej commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

samacqua commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Method

Notes

Ablations

Results

Command

Included files

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

evan-conway commented Mar 19, 2026

Uh oh!

FI-Mihej commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

samacqua commented Mar 19, 2026 •

edited

Loading

FI-Mihej commented Mar 20, 2026 •

edited

Loading