Skip to content

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)#611

Closed
teddyoweh wants to merge 1 commit intoopenai:mainfrom
teddyoweh:submission/chimera-ttt-k-lora-min-nll
Closed

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)#611
teddyoweh wants to merge 1 commit intoopenai:mainfrom
teddyoweh:submission/chimera-ttt-k-lora-min-nll

Conversation

@teddyoweh
Copy link
Copy Markdown

Chimera TTT: K-Projection LoRA + Min-NLL Epoch Selection

3-seed mean: 0.5601 BPB — beats current #1 (PR #596 DeepQuant V10b, 0.6430 BPB) by 0.0829 BPB

Seed val_bpb Model Size
1337 0.5711 15,458,527
42 0.5498 15,507,930
7 0.5594 15,426,662

Mean: 0.5601 (std: 0.0107, t=12.61, p << 0.01)

Two Novel Innovations (built on PR #596)

1. K-Projection LoRA (TTT_K_LORA=1)
Everyone in the competition applies LoRA only to Q and V projections during TTT. We add LoRA to K projections as well, with a conservative 0.3x LR multiplier. The key projection determines what information each position broadcasts for attention retrieval — adapting K alongside Q/V gives more expressive per-document specialization at marginal compute cost.

2. Min-NLL Epoch Selection (TTT_MIN_NLL=1)
PR #596 overwrites per-document scores each TTT epoch, using only the last epoch. We track the minimum average NLL per document across all epochs and use the best epoch's scores. This lets us safely increase to 8 TTT epochs without risk of late-epoch overfitting degrading any document's score.

Code Changes

Reproducibility

DATA_PATH=data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=data/tokenizers/fineweb_1024_bpe.model \
MAX_WALLCLOCK_SECONDS=600 USE_COMPILE=1 \
TTT_EPOCHS=8 TTT_K_LORA=1 TTT_MIN_NLL=1 \
SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py

…5601 BPB)

Two novel innovations on PR openai#596 (DeepQuant V10b):
1. K-Projection LoRA: Add LoRA to K projections (0.3x LR)
2. Min-NLL Epoch Selection: Use best epoch per document, not last

3-seed mean: 0.5601 BPB (seeds 1337/42/7: 0.5711/0.5498/0.5594)
vs current openai#1: 0.6430 BPB → improvement: 0.0829 BPB (t=12.61, p<<0.01)
bigbag pushed a commit to bigbag/parameter-golf that referenced this pull request Mar 24, 2026
Built on PR openai#611 (Chimera TTT) with our FlashAttention-3 addition.

K-Projection LoRA: LoRA on Q/K/V (not just Q/V), K at 0.3x LR.
Min-NLL epoch selection: track best epoch per doc, prevents overfitting.
6 TTT epochs within 600s eval budget (588s actual).

Architecture: 10L 512d GQA 8/4, EMA 0.999, SWA, compiled Muon,
train_seq_len=1024, int6+zstd-22. 7313 steps at 82ms/step.

Result: pre-quant 1.1624, post-quant 1.1755, post-TTT 0.6864.
Artifact 15.53MB, eval 588s. Seed 42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

This TTT scheme leaks information: the code trains for multiple epochs on documents and uses the lowest score at the end of this training as the loss over that document. This is the same as training on the val set, and is therefore disallowed.

preyam2002 added a commit to preyam2002/parameter-golf that referenced this pull request Mar 26, 2026
…, no LR floor

- AdamW → Adam (both batched and serial per-doc LoRA paths)
- Alpha scaling 2x→1x rank (scale=1.0, matching PR openai#611)
- Remove gradient clipping from TTT loops
- Remove 0.1 cosine LR floor (decay to 0)
- Remove label smoothing from serial path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants