Skip to content

[record bpb=1.195] sliding window + LoRA TTT#77

Merged
0hq merged 1 commit intoopenai:mainfrom
samacqua:main
Mar 19, 2026
Merged

[record bpb=1.195] sliding window + LoRA TTT#77
0hq merged 1 commit intoopenai:mainfrom
samacqua:main

Conversation

@samacqua
Copy link
Copy Markdown
Contributor

@samacqua samacqua commented Mar 19, 2026

This record captures LoRA TTT: the naive baseline model + document masking + sliding window + LoRA test-time training at evaluation.

Method

Training is identical to the naive baseline.

Evaluation adds per-document LoRA test-time training (TTT). For each document in the validation set:

  1. Find document boundaries using BOS tokens
  2. Split the document into overlapping chunks (chunk_size=256 within eval_seq_len=1024 context windows)
  3. For each chunk, score it (accumulate loss/bytes for BPB), then train rank-8 LoRA adapters on that chunk's loss (so you only train on the context -- no leakage)
  4. Reset LoRA parameters between documents (no leakake across documents)

Documents are batched (batch_size=64) and sorted by length for efficiency. The LoRA adapters target lm_head, c_q, and c_v projections in all transformer blocks. A single Adam optimizer with lr=0.01, betas=(0.9, 0.95) trains all LoRA parameters with one gradient step per chunk.

Notes

This is very similar to a record I submmited to the modded nano-gpt speedrun repo.
The major addition is to make the test-time training ~5x faster by using LoRAs: this let's you have per-sequence adaptation (no leaking between validation sequences) while still batching.

This is not a heavily optimized run: I just wanted to plant the TTT seed.
It uses ~1/10th of the evaluation budget.

Ablations

The majority of this improvement doesn't come from the TTT itself, but from
1). Only conditioning on the current document
2). Using a sliding window at eval

Condition val_loss val_bpb Delta bpb
Baseline (cross-doc, flat stream) 2.0731 1.2278
+ Doc-isolated 2.0561 1.2168 -0.0110
+ Stride (chunk=256) 2.0177 1.1941 -0.0337
+ LoRA TTT 2.0126 1.1910 -0.0368
image

Results

Validated on the full 50k-document fineweb_val split. Submitting at bpb=1.195.

bpb: [1.1927, 1.1935, 1.1921, 1.1929]
mean: 1.1928
std: 0.0005
p-value < 1.195: 0.00234486

Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Included files

  • train_gpt.py
  • train_v*.txt (note that train_v0.txt is on 2xH100)
  • submission.json

@samacqua samacqua changed the title [record bpb=1.195] LoRA TTT [record bpb=1.195] sliding window + LoRA TTT Mar 19, 2026
phaesoo added a commit to phaesoo/parameter-golf that referenced this pull request Mar 19, 2026
openai#77, openai#78)

Analyzed techniques, ablations, and individual BPB contributions.
Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029)
are the dominant validated techniques. Several promising combinations
remain untested across submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jordankzf jordankzf mentioned this pull request Mar 19, 2026
xskuy pushed a commit to xskuy/parameter-golf that referenced this pull request Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs):

1. Sliding window eval (stride=256): overlapping windows give each token
   more context. Free ~0.03 bpb improvement, zero artifact cost.
   Based on PRs openai#70, openai#77, openai#65.

2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and
   EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8,
   allowing bigger models. Based on PRs openai#78, openai#70.

3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives
   ~0.019 bpb improvement. Based on PRs openai#70, openai#66.

4. Default dim=512 with LR=0.03 (best config from experiments).

5. forward_logits() helper for sliding window (avoids model.forward
   which returns loss, not logits).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Awesome! I'll try to review and get this in ASAP.

simon-archer added a commit to simon-archer/parameter-golf that referenced this pull request Mar 19, 2026
1. NUM_LAYERS=7 (from 8) to fit int6 artifact under 16MB
2. TTT now freezes all params except embedding + last block,
   reducing backward pass memory by ~85%. This is closer to
   PR openai#77's LoRA TTT approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@evan-conway
Copy link
Copy Markdown

This is really cool! One interesting thing to experiment with would be changing the training dataset, since this helps remove some of the train-test mismatch issues you get from doing that. In NanoChat there's been some success using ClimbMix, so that'd be fun to test out. Karpathy has a shuffle here.

@FI-Mihej
Copy link
Copy Markdown

FI-Mihej commented Mar 20, 2026

@0hq Is it intentional that this PR not only creates a subfolder with the code, but also globally modifies the reference train_gpt.py in the root of the repository for everyone else?

anantdgoel pushed a commit to anantdgoel/parameter-golf that referenced this pull request Mar 20, 2026
Explores unigram cache language model (Grave et al. 2017) combined with
per-document LoRA test-time training as eval-time techniques.

Key findings:
- LoRA TTT: -0.003 BPB (confirms PR openai#77's ablation)
- Cache LM: negative result on FineWeb (λ=0.02 hurts by +0.002)
- Web text burstiness is too low for unigram cache to help

Model: v8192 7L GQA + MTP, 2000 steps on 1xA100 (1/10th budget)
Best: 1.2529 BPB (LoRA TTT only), 1.2544 (with cache)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551).
Adds four improvements targeting quantization fidelity and eval-time adaptation:

- KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights
- Tanh weight reparameterization bounding effective weights to [-1,1]
- Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection
- Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Added SGD-based TTT that adapts model to val data during eval.
Credit: @timowhite88 PR openai#152, @samacqua PR openai#77.
Currently hangs with torch.compile — needs uncompiled model path.
Expected ~0.03 BPB improvement when working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Fixed TTT by using compiled model (same as training) instead of
creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4.
Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s.

Credit: @timowhite88 PR openai#152, @samacqua PR openai#77.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scottspace pushed a commit to scottspace/parameter-golf that referenced this pull request Mar 21, 2026
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Added SGD-based TTT that adapts model to val data during eval.
Credit: @timowhite88 PR openai#152, @samacqua PR openai#77.
Currently hangs with torch.compile — needs uncompiled model path.
Expected ~0.03 BPB improvement when working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Fixed TTT by using compiled model (same as training) instead of
creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4.
Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s.

Credit: @timowhite88 PR openai#152, @samacqua PR openai#77.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 22, 2026
Added SGD-based TTT that adapts model to val data during eval.
Credit: @timowhite88 PR openai#152, @samacqua PR openai#77.
Currently hangs with torch.compile — needs uncompiled model path.
Expected ~0.03 BPB improvement when working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
ahmettrkck added a commit to ahmettrkck/parameter-golf that referenced this pull request Mar 25, 2026
Multi-epoch TTT was ruled invalid by organizers (PR openai#568 closed).
Now: score each chunk BEFORE training, single pass, each token
scored exactly once. Matches PR openai#77 pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants