Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462
Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462JoeProAI wants to merge 2 commits intoopenai:mainfrom
Conversation
Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.
a4faacd to
7e2f938
Compare
|
1.0672 is crazy, the swiglu + unet skip combo is interesting af. how much of that is coming from the adamw ttt vs the architecture itself? 0.053 bpb from ttt alone is wild |
Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462): - SwiGLU FFN with Star-ReLU (hidden=1792) - U-Net skip connections with learned gating - EMA (decay=0.9985) replacing SWA - AdamW TTT (legal score-first protocol) - Partial RoPE (16 dims) - LN Scale (1/sqrt(layer_idx+1)) - BigramHash(8192) + SmearGate - GPTQ-lite quantization - DDP compile fix for multi-GPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Good question — we have the ablation data across 6 waves of experiments. Architecture without TTT (EMA + RoPE + LN Scale + Late QAT): 1.1320 Architecture + SGD TTT (30ep, lr=0.0065, cosine decay): 1.1209 Architecture + AdamW TTT (10ep, lr=0.0005): 1.0677 For comparison, @sjp611's AdamW TTT on the #398 base went from 1.1213 → 1.0992 (Δ = 0.019). On our architecture: 1.1209 → 1.0677 (Δ = 0.053). The architecture is the multiplier — AdamW TTT produces ~2.8x more improvement on SwiGLU + U-Net skip connections than on the standard stack. Our working hypothesis is that the gated residual paths create smoother loss geometry that adaptive optimizers exploit more effectively during test-time adaptation. |
|
thats insane ablation data thanks for sharing. so swiglu + unet basically makes the loss landscape way smoother for ttt to exploit, that makes alot of sense. 2.8x more improvement from the same ttt recipe is wild |
…rchitecture + cosine TTT
Non-record submission building on PR openai#462's architecture with: - XSA on all 11 layers (was 4) - Cosine TTT 30 epochs with per-layer LR groups - GPTQ-lite optimal clip percentile search - Legal score-first TTT protocol - Meta-TTT (FOMAML) in development Awaiting compute for validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base: PR openai#462's SwiGLU + XSA4 + U-Net architecture (1.0672 BPB) Novel additions (untried combination): 1. 25 TTT epochs (up from 10) - loss still dropping at epoch 10 2. Per-layer TTT LR by quantization sensitivity: - MLP output projections: 3x LR (highest quant damage) - MLP input projections: 0.5x LR - Everything else: 1x LR 3. DDP optimize_ddp fix for PyTorch 2.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Analysis showed TTT trains on val tokens that get scored. More epochs = memorizing val noise. PR openai#462 chose 10 deliberately. Keep per-layer quant-sensitivity LR as sole novel contribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #462 achieves 1.0672 BPB. Their key finding: switching TTT optimizer from SGD to AdamW gives 5x more improvement (0.053 vs 0.011 BPB). AdamW's per-parameter adaptive LR handles the heterogeneous update needs of attention/MLP/control params naturally — exactly what we were trying to do manually. New defaults (matching PR #462 recipe): TTT_OPTIMIZER=adamw (was implicit SGD) TTT_LR=0.0005 (was 0.002) TTT_EPOCHS=10 (was 3) TTT_FREEZE_BLOCKS=0 (was 2) Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Takes PR #462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB) and adds our proven quantization improvements: 1. GPTQ — Hessian-aware int6 with column reordering + optimal scales 2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps 3. QAT percentile clipping — matches GPTQ export quantizer Base architecture credit: @JoeProAI (PR #462) AdamW TTT credit: @sjp611 (PR #442) GPTQ integration: our contribution Uses PyTorch native SDPA (no FA3 needed) — runs on any H100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation hurts zstd compression. Next step: solve the size problem. v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#462 achieves 1.0672 BPB. Their key finding: switching TTT optimizer from SGD to AdamW gives 5x more improvement (0.053 vs 0.011 BPB). AdamW's per-parameter adaptive LR handles the heterogeneous update needs of attention/MLP/control params naturally — exactly what we were trying to do manually. New defaults (matching PR openai#462 recipe): TTT_OPTIMIZER=adamw (was implicit SGD) TTT_LR=0.0005 (was 0.002) TTT_EPOCHS=10 (was 3) TTT_FREEZE_BLOCKS=0 (was 2) Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Takes PR openai#462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB) and adds our proven quantization improvements: 1. GPTQ — Hessian-aware int6 with column reordering + optimal scales 2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps 3. QAT percentile clipping — matches GPTQ export quantizer Base architecture credit: @JoeProAI (PR openai#462) AdamW TTT credit: @sjp611 (PR openai#442) GPTQ integration: our contribution Uses PyTorch native SDPA (no FA3 needed) — runs on any H100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SwiGLU fork (PR openai#462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation hurts zstd compression. Next step: solve the size problem. v7 GPTQ stack submitted as PR openai#508: 3-seed mean 1.1215 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#462 achieves 1.0672 BPB. Their key finding: switching TTT optimizer from SGD to AdamW gives 5x more improvement (0.053 vs 0.011 BPB). AdamW's per-parameter adaptive LR handles the heterogeneous update needs of attention/MLP/control params naturally — exactly what we were trying to do manually. New defaults (matching PR openai#462 recipe): TTT_OPTIMIZER=adamw (was implicit SGD) TTT_LR=0.0005 (was 0.002) TTT_EPOCHS=10 (was 3) TTT_FREEZE_BLOCKS=0 (was 2) Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…LU+TTT+XSA+EMA) - train_avery.py: PR openai#462 code + DifferentialAttention class (arXiv:2410.05258) - DiffAttn applied to layers 0-6 (7 of 11 layers); layers 7-10 keep standard+XSA - Same parameter count as standard attention (Q/K/V/proj unchanged, +56 lambda floats) - Lambda init per paper depth formula: 0.8 - 0.6*exp(-0.3*layer_idx) - All PR#462 features retained: EMA, TTT, int6+zstd, late-QAT, bigram hash, smear gate - train_avery_notes.md: strategy notes, tuning guide, parameter budget analysis
|
Thanks mohosy, and to everyone who's been building on this work. Genuinely grateful to be part of this competition -- seeing the community take the architecture further and run their own ablations has been the best part of the whole thing. Good luck to everyone still in it. |
|
As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition. Specifically, around lines 1417-1428 the code calls ttt_adapt(..., val_tokens, ...) before final evaluation, and ttt_adapt itself (around lines 938-996) iterates over val_tokens and does optimizer.step(). |
Key findings from openai#462: - 8 KV heads (full MHA, not GQA) = doubles attention capacity - Star-ReLU (relu²*scale+bias) = learned affine per neuron - MLP hidden=1792 (3.5x) = 17% more MLP capacity - seq_len=1024 training = faster steps, more total steps - BigramHash 8192 = 4x hash resolution - No grad clipping, decoder LR 2x, EMA 0.9985 - AdamW TTT with cosine decay = -0.053 BPB (rule-questionable) Neptune config: openai#462 arch (no TTT) + GPTQ per-row → estimated 1.118-1.123 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR#462 base with FINAL council modifications: - NUM_KV_HEADS: 8→4 (GQA, proven quant quality) - BIGRAM_BUCKETS: 8192→4096 (compromise for quant) - Star-ReLU MLP1792 (kept from openai#462) - Adaptive bitwidth: int7 for deep Q,K; int5 for early MLP; int6 default - GPTQ per-row: 5 percentile search per row - Backout connection: learned residual subtraction from mid-layer - TTT_ENABLED=0 (rule compliance) - seq_len=1024, WD=6000, EMA=0.9985, QAT=0.15 - Decoder LR 2x, no grad clip Expected: 1.111-1.116 BPB (beats non-TTT SOTA 1.123) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)
3-seed mean val_bpb: 1.0672 | Best seed: 1.0658
Verified on 8xH100 80GB, 10-minute wall-clock budget.
Approach
Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s.
Architecture (discovered by GEPA)
Training techniques (adopted + tuned)
3-Seed Results
Comparison to prior SOTA
Key finding
AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training.
Credits
Built by @JoePro (GitHub: @JoeProAI) with AI agent assistance: OpenClaw (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination.
Run command
All hyperparameters are set as defaults in
train_gpt.py.Disclosure
This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).