Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean) by teddyoweh · Pull Request #611 · openai/parameter-golf

teddyoweh · 2026-03-24T11:19:24Z

Chimera TTT: K-Projection LoRA + Min-NLL Epoch Selection

3-seed mean: 0.5601 BPB — beats current #1 (PR #596 DeepQuant V10b, 0.6430 BPB) by 0.0829 BPB

Seed	val_bpb	Model Size
1337	0.5711	15,458,527
42	0.5498	15,507,930
7	0.5594	15,426,662

Mean: 0.5601 (std: 0.0107, t=12.61, p << 0.01)

Two Novel Innovations (built on PR #596)

1. K-Projection LoRA (TTT_K_LORA=1)
Everyone in the competition applies LoRA only to Q and V projections during TTT. We add LoRA to K projections as well, with a conservative 0.3x LR multiplier. The key projection determines what information each position broadcasts for attention retrieval — adapting K alongside Q/V gives more expressive per-document specialization at marginal compute cost.

2. Min-NLL Epoch Selection (TTT_MIN_NLL=1)
PR #596 overwrites per-document scores each TTT epoch, using only the last epoch. We track the minimum average NLL per document across all epochs and use the best epoch's scores. This lets us safely increase to 8 TTT epochs without risk of late-epoch overfitting degrading any document's score.

Code Changes

Diff vs PR Record: DeepQuant V10b — 11L INT6 + 8ep LoRA TTT (val_bpb=0.6430) #596: ~30 lines changed
Total: 1498 lines (under 1500 limit)
2 new env vars: TTT_K_LORA, TTT_MIN_NLL
All 3 seed logs included

Reproducibility

DATA_PATH=data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=data/tokenizers/fineweb_1024_bpe.model \
MAX_WALLCLOCK_SECONDS=600 USE_COMPILE=1 \
TTT_EPOCHS=8 TTT_K_LORA=1 TTT_MIN_NLL=1 \
SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py

…5601 BPB) Two novel innovations on PR openai#596 (DeepQuant V10b): 1. K-Projection LoRA: Add LoRA to K projections (0.3x LR) 2. Min-NLL Epoch Selection: Use best epoch per document, not last 3-seed mean: 0.5601 BPB (seeds 1337/42/7: 0.5711/0.5498/0.5594) vs current openai#1: 0.6430 BPB → improvement: 0.0829 BPB (t=12.61, p<<0.01)

Built on PR openai#611 (Chimera TTT) with our FlashAttention-3 addition. K-Projection LoRA: LoRA on Q/K/V (not just Q/V), K at 0.3x LR. Min-NLL epoch selection: track best epoch per doc, prevents overfitting. 6 TTT epochs within 600s eval budget (588s actual). Architecture: 10L 512d GQA 8/4, EMA 0.999, SWA, compiled Muon, train_seq_len=1024, int6+zstd-22. 7313 steps at 82ms/step. Result: pre-quant 1.1624, post-quant 1.1755, post-TTT 0.6864. Artifact 15.53MB, eval 588s. Seed 42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-24T15:25:08Z

This TTT scheme leaks information: the code trains for multiple epochs on documents and uses the lowest score at the end of this training as the loss over that document. This is the same as training on the val set, and is therefore disallowed.

…, no LR floor - AdamW → Adam (both batched and serial per-doc LoRA paths) - Alpha scaling 2x→1x rank (scale=1.0, matching PR openai#611) - Remove gradient clipping from TTT loops - Remove 0.1 cosine LR floor (decay to 0) - Remove label smoothing from serial path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag mentioned this pull request Mar 24, 2026

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3 #614

Closed

3 tasks

himanalot mentioned this pull request Mar 24, 2026

Invalid submissions due to information leakage during TTT #402

Open

valerio-oai closed this Mar 24, 2026

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)#611

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)#611
teddyoweh wants to merge 1 commit intoopenai:mainfrom
teddyoweh:submission/chimera-ttt-k-lora-min-nll

teddyoweh commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teddyoweh commented Mar 24, 2026

Chimera TTT: K-Projection LoRA + Min-NLL Epoch Selection

Two Novel Innovations (built on PR #596)

Code Changes

Reproducibility

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants