Skip to content

Record: 0.7227 BPB — 10L LoRA TTT 6ep + FlashAttention-3#605

Closed
bigbag wants to merge 5 commits intoopenai:mainfrom
bigbag:submission/lora-ttt-fa3-0.7227
Closed

Record: 0.7227 BPB — 10L LoRA TTT 6ep + FlashAttention-3#605
bigbag wants to merge 5 commits intoopenai:mainfrom
bigbag:submission/lora-ttt-fa3-0.7227

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Mar 24, 2026

Summary

  • val_bpb: 0.7227 (single seed, 8xH100 SXM)
  • val_loss: 1.2203
  • Artifact: 15.45 MB (96.5% of 16MB limit)
  • Training: 600s (7274 steps at 82.5ms/step)
  • Eval (TTT + scoring): 569s (within 600s budget)

Architecture

  • 10L, 512d, GQA 8/4, MLP 3x with ReLU-squared
  • EMA (decay 0.999, every 10 steps) + SWA (11 checkpoints)
  • SmearGate, BigramHash(2048), U-Net skip connections
  • Late QAT, int6 uniform quantization + zstd-22
  • Compiled Muon Newton-Schulz optimizer, train_seq_len=1024

LoRA TTT (per-document adaptation)

  • Rank-8 LoRA on Q/V projections + rank-16 on LM-head
  • Per-block bias tuning during TTT
  • Per-document reset at BOS boundaries (batched 64 docs/GPU)
  • Adam optimizer, lr=0.01, 6 epochs per document batch
  • Per-step cosine LR decay, gradient clipping 1.0
  • Post-TTT temperature rescaling (T=0.98)
  • Wall-clock deadline 550s with base-model fallback for remaining docs
  • Score-every-epoch (backward-looking, Issue Invalid submissions due to information leakage during TTT #402 compliant)

Our additions over base (PR #596)

  • FlashAttention-3 (flash_attn_func) — 3% faster attention on H100
  • Rotary cache .clone() fix — resolves CUDA graph conflict with FA3

Results breakdown

  • Pre-quantization BPB: 1.1621
  • Post-quantization BPB: 1.1750 (quant gap: 0.013)
  • Post-TTT BPB: 0.7227 (TTT gain: 0.4522)
  • Short docs (no TTT): 2393 docs, 27s
  • Long docs (LoRA TTT): 3857 docs, 61 batches, 377s

Test plan

  • Artifact under 16MB (15.45MB)
  • Training under 600s (600s)
  • Eval under 600s (569s)

Pavel Liashkov and others added 5 commits March 24, 2026 14:48
Based on PR openai#596 (DeepQuant V10b) with FlashAttention-3 addition.

Architecture: 10L 512d GQA 8/4, EMA 0.999, SWA, Late QAT,
SmearGate, BigramHash(2048), compiled Muon Newton-Schulz.

LoRA TTT: rank-8 Q/V + rank-16 LM-head, per-block bias tuning,
per-document adaptation (BOS boundaries), batched 64 docs/GPU,
Adam lr=0.01, 6 epochs, per-step cosine LR, temperature 0.98,
wall-clock deadline 550s with base-model fallback.

Hardware: FlashAttention-3 (flash_attn_func), Rotary cache
.clone() fix for CUDA graph compatibility, train_seq_len=1024.

Result: 7274 steps at 82.5ms/step, pre-quant 1.1621 BPB,
post-quant 1.1750, post-TTT 0.7227. Artifact 15.4MB, eval 569s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

This TTT scheme leaks information: the code trains for multiple epochs on documents and uses the lowest score at the end of this training as the loss over that document. This is the same as training on the val set, and is therefore disallowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants