[record bpb=1.195] sliding window + LoRA TTT#77
Merged
0hq merged 1 commit intoopenai:mainfrom Mar 19, 2026
Merged
Conversation
phaesoo
added a commit
to phaesoo/parameter-golf
that referenced
this pull request
Mar 19, 2026
openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closed
xskuy
pushed a commit
to xskuy/parameter-golf
that referenced
this pull request
Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
Awesome! I'll try to review and get this in ASAP. |
simon-archer
added a commit
to simon-archer/parameter-golf
that referenced
this pull request
Mar 19, 2026
1. NUM_LAYERS=7 (from 8) to fit int6 artifact under 16MB 2. TTT now freezes all params except embedding + last block, reducing backward pass memory by ~85%. This is closer to PR openai#77's LoRA TTT approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This is really cool! One interesting thing to experiment with would be changing the training dataset, since this helps remove some of the train-test mismatch issues you get from doing that. In NanoChat there's been some success using ClimbMix, so that'd be fun to test out. Karpathy has a shuffle here. |
0hq
approved these changes
Mar 19, 2026
4 tasks
|
@0hq Is it intentional that this PR not only creates a subfolder with the code, but also globally modifies the reference |
anantdgoel
pushed a commit
to anantdgoel/parameter-golf
that referenced
this pull request
Mar 20, 2026
Explores unigram cache language model (Grave et al. 2017) combined with per-document LoRA test-time training as eval-time techniques. Key findings: - LoRA TTT: -0.003 BPB (confirms PR openai#77's ablation) - Cache LM: negative result on FineWeb (λ=0.02 hurts by +0.002) - Web text burstiness is too low for unigram cache to help Model: v8192 7L GQA + MTP, 2000 steps on 1xA100 (1/10th budget) Best: 1.2529 BPB (LoRA TTT only), 1.2544 (with cache) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
machdragon
added a commit
to machdragon/parameter-golf
that referenced
this pull request
Mar 20, 2026
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551). Adds four improvements targeting quantization fidelity and eval-time adaptation: - KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights - Tanh weight reparameterization bounding effective weights to [-1,1] - Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection - Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 tasks
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Fixed TTT by using compiled model (same as training) instead of creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4. Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scottspace
pushed a commit
to scottspace/parameter-golf
that referenced
this pull request
Mar 21, 2026
leonardcser
pushed a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Fixed TTT by using compiled model (same as training) instead of creating uncompiled copy. 1 epoch SGD through val data with lr=3e-4. Improvement: 1.2323 → 1.2312 (-0.001 BPB). Takes ~50s. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 22, 2026
Added SGD-based TTT that adapts model to val data during eval. Credit: @timowhite88 PR openai#152, @samacqua PR openai#77. Currently hangs with torch.compile — needs uncompiled model path. Expected ~0.03 BPB improvement when working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
4 tasks
ahmettrkck
added a commit
to ahmettrkck/parameter-golf
that referenced
this pull request
Mar 25, 2026
Multi-epoch TTT was ruled invalid by organizers (PR openai#568 closed). Now: score each chunk BEFORE training, single pass, each token scored exactly once. Matches PR openai#77 pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut
pushed a commit
to nedcut/parameter-golf
that referenced
this pull request
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This record captures
LoRA TTT: the naive baseline model + document masking + sliding window + LoRA test-time training at evaluation.Method
Training is identical to the naive baseline.
Evaluation adds per-document LoRA test-time training (TTT). For each document in the validation set:
Documents are batched (batch_size=64) and sorted by length for efficiency. The LoRA adapters target
lm_head,c_q, andc_vprojections in all transformer blocks. A single Adam optimizer withlr=0.01, betas=(0.9, 0.95)trains all LoRA parameters with one gradient step per chunk.Notes
This is very similar to a record I submmited to the modded nano-gpt speedrun repo.
The major addition is to make the test-time training ~5x faster by using LoRAs: this let's you have per-sequence adaptation (no leaking between validation sequences) while still batching.
This is not a heavily optimized run: I just wanted to plant the TTT seed.
It uses ~1/10th of the evaluation budget.
Ablations
The majority of this improvement doesn't come from the TTT itself, but from
1). Only conditioning on the current document
2). Using a sliding window at eval
Results
Validated on the full 50k-document fineweb_val split. Submitting at
bpb=1.195.bpb: [1.1927, 1.1935, 1.1921, 1.1929] mean: 1.1928 std: 0.0005 p-value < 1.195: 0.00234486Command
Included files
train_gpt.pytrain_v*.txt(note thattrain_v0.txtis on 2xH100)submission.json