Skip to content

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462

Closed
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-xsa4-adamw-ttt-1.0672
Closed

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-xsa4-adamw-ttt-1.0672

Conversation

@JoeProAI
Copy link
Copy Markdown

@JoeProAI JoeProAI commented Mar 22, 2026

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)

3-seed mean val_bpb: 1.0672 | Best seed: 1.0658
Verified on 8xH100 80GB, 10-minute wall-clock budget.

Approach

Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s.

Architecture (discovered by GEPA)

  • SwiGLU FFN with Star-ReLU activation
  • U-Net skip connections with learned gating
  • BigramHash embeddings (8192 buckets, 128 dim)
  • SmearGate on embeddings
  • 11 layers, 512 dim, 8 heads, 8 KV heads, MLP hidden=1792, tied embeddings

Training techniques (adopted + tuned)

3-Seed Results

Seed val_bpb
42 1.06733191
123 1.06833018
7 1.06579646
Mean 1.06715285
Std 0.00104211

Comparison to prior SOTA

Submission Mean BPB Best BPB
Ours 1.0672 1.0658
@sjp611 (#442) 1.1027 1.0992
@felipe-parodi (#398) 1.1221 1.1213
@thwu1 (#180, merged) 1.1428 --

Key finding

AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training.

Credits

Built by @JoePro (GitHub: @JoeProAI) with AI agent assistance: OpenClaw (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination.

Run command

# Default seed
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Specific seed
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

All hyperparameters are set as defaults in train_gpt.py.

Disclosure

This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).

Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

1.0672 is crazy, the swiglu + unet skip combo is interesting af. how much of that is coming from the adamw ttt vs the architecture itself? 0.053 bpb from ttt alone is wild

mohosy pushed a commit to mohosy/parameter-golf that referenced this pull request Mar 23, 2026
Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462):
- SwiGLU FFN with Star-ReLU (hidden=1792)
- U-Net skip connections with learned gating
- EMA (decay=0.9985) replacing SWA
- AdamW TTT (legal score-first protocol)
- Partial RoPE (16 dims)
- LN Scale (1/sqrt(layer_idx+1))
- BigramHash(8192) + SmearGate
- GPTQ-lite quantization
- DDP compile fix for multi-GPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JoeProAI
Copy link
Copy Markdown
Author

Good question — we have the ablation data across 6 waves of experiments.

Architecture without TTT (EMA + RoPE + LN Scale + Late QAT): 1.1320

Architecture + SGD TTT (30ep, lr=0.0065, cosine decay): 1.1209

Architecture + AdamW TTT (10ep, lr=0.0005): 1.0677

For comparison, @sjp611's AdamW TTT on the #398 base went from 1.1213 → 1.0992 (Δ = 0.019). On our architecture: 1.1209 → 1.0677 (Δ = 0.053).

The architecture is the multiplier — AdamW TTT produces ~2.8x more improvement on SwiGLU + U-Net skip connections than on the standard stack. Our working hypothesis is that the gated residual paths create smoother loss geometry that adaptive optimizers exploit more effectively during test-time adaptation.

@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

thats insane ablation data thanks for sharing. so swiglu + unet basically makes the loss landscape way smoother for ttt to exploit, that makes alot of sense. 2.8x more improvement from the same ttt recipe is wild

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
george11642 added a commit to george11642/parameter-golf that referenced this pull request Mar 23, 2026
Non-record submission building on PR openai#462's architecture with:
- XSA on all 11 layers (was 4)
- Cosine TTT 30 epochs with per-layer LR groups
- GPTQ-lite optimal clip percentile search
- Legal score-first TTT protocol
- Meta-TTT (FOMAML) in development

Awaiting compute for validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
het4rk added a commit to het4rk/parameter-golf that referenced this pull request Mar 23, 2026
Base: PR openai#462's SwiGLU + XSA4 + U-Net architecture (1.0672 BPB)
Novel additions (untried combination):
1. 25 TTT epochs (up from 10) - loss still dropping at epoch 10
2. Per-layer TTT LR by quantization sensitivity:
   - MLP output projections: 3x LR (highest quant damage)
   - MLP input projections: 0.5x LR
   - Everything else: 1x LR
3. DDP optimize_ddp fix for PyTorch 2.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
het4rk added a commit to het4rk/parameter-golf that referenced this pull request Mar 23, 2026
Analysis showed TTT trains on val tokens that get scored.
More epochs = memorizing val noise. PR openai#462 chose 10 deliberately.
Keep per-layer quant-sensitivity LR as sole novel contribution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
PR #462 achieves 1.0672 BPB. Their key finding: switching TTT
optimizer from SGD to AdamW gives 5x more improvement (0.053 vs
0.011 BPB). AdamW's per-parameter adaptive LR handles the
heterogeneous update needs of attention/MLP/control params
naturally — exactly what we were trying to do manually.

New defaults (matching PR #462 recipe):
  TTT_OPTIMIZER=adamw (was implicit SGD)
  TTT_LR=0.0005 (was 0.002)
  TTT_EPOCHS=10 (was 3)
  TTT_FREEZE_BLOCKS=0 (was 2)

Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Takes PR #462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB)
and adds our proven quantization improvements:

1. GPTQ — Hessian-aware int6 with column reordering + optimal scales
2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps
3. QAT percentile clipping — matches GPTQ export quantizer

Base architecture credit: @JoeProAI (PR #462)
AdamW TTT credit: @sjp611 (PR #442)
GPTQ integration: our contribution

Uses PyTorch native SDPA (no FA3 needed) — runs on any H100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB
but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation
hurts zstd compression. Next step: solve the size problem.

v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
PR openai#462 achieves 1.0672 BPB. Their key finding: switching TTT
optimizer from SGD to AdamW gives 5x more improvement (0.053 vs
0.011 BPB). AdamW's per-parameter adaptive LR handles the
heterogeneous update needs of attention/MLP/control params
naturally — exactly what we were trying to do manually.

New defaults (matching PR openai#462 recipe):
  TTT_OPTIMIZER=adamw (was implicit SGD)
  TTT_LR=0.0005 (was 0.002)
  TTT_EPOCHS=10 (was 3)
  TTT_FREEZE_BLOCKS=0 (was 2)

Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Takes PR openai#462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB)
and adds our proven quantization improvements:

1. GPTQ — Hessian-aware int6 with column reordering + optimal scales
2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps
3. QAT percentile clipping — matches GPTQ export quantizer

Base architecture credit: @JoeProAI (PR openai#462)
AdamW TTT credit: @sjp611 (PR openai#442)
GPTQ integration: our contribution

Uses PyTorch native SDPA (no FA3 needed) — runs on any H100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
SwiGLU fork (PR openai#462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB
but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation
hurts zstd compression. Next step: solve the size problem.

v7 GPTQ stack submitted as PR openai#508: 3-seed mean 1.1215 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
PR openai#462 achieves 1.0672 BPB. Their key finding: switching TTT
optimizer from SGD to AdamW gives 5x more improvement (0.053 vs
0.011 BPB). AdamW's per-parameter adaptive LR handles the
heterogeneous update needs of attention/MLP/control params
naturally — exactly what we were trying to do manually.

New defaults (matching PR openai#462 recipe):
  TTT_OPTIMIZER=adamw (was implicit SGD)
  TTT_LR=0.0005 (was 0.002)
  TTT_EPOCHS=10 (was 3)
  TTT_FREEZE_BLOCKS=0 (was 2)

Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bentenavery added a commit to bentenavery/parameter-golf that referenced this pull request Mar 24, 2026
…LU+TTT+XSA+EMA)

- train_avery.py: PR openai#462 code + DifferentialAttention class (arXiv:2410.05258)
- DiffAttn applied to layers 0-6 (7 of 11 layers); layers 7-10 keep standard+XSA
- Same parameter count as standard attention (Q/K/V/proj unchanged, +56 lambda floats)
- Lambda init per paper depth formula: 0.8 - 0.6*exp(-0.3*layer_idx)
- All PR#462 features retained: EMA, TTT, int6+zstd, late-QAT, bigram hash, smear gate
- train_avery_notes.md: strategy notes, tuning guide, parameter budget analysis
@JoeProAI
Copy link
Copy Markdown
Author

Thanks mohosy, and to everyone who's been building on this work. Genuinely grateful to be part of this competition -- seeing the community take the architecture further and run their own ablations has been the best part of the whole thing. Good luck to everyone still in it.

@valerio-oai
Copy link
Copy Markdown
Contributor

valerio-oai commented Mar 24, 2026

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition. Specifically, around lines 1417-1428 the code calls ttt_adapt(..., val_tokens, ...) before final evaluation, and ttt_adapt itself (around lines 938-996) iterates over val_tokens and does optimizer.step().

haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
Key findings from openai#462:
- 8 KV heads (full MHA, not GQA) = doubles attention capacity
- Star-ReLU (relu²*scale+bias) = learned affine per neuron
- MLP hidden=1792 (3.5x) = 17% more MLP capacity
- seq_len=1024 training = faster steps, more total steps
- BigramHash 8192 = 4x hash resolution
- No grad clipping, decoder LR 2x, EMA 0.9985
- AdamW TTT with cosine decay = -0.053 BPB (rule-questionable)

Neptune config: openai#462 arch (no TTT) + GPTQ per-row → estimated 1.118-1.123

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
PR#462 base with FINAL council modifications:
- NUM_KV_HEADS: 8→4 (GQA, proven quant quality)
- BIGRAM_BUCKETS: 8192→4096 (compromise for quant)
- Star-ReLU MLP1792 (kept from openai#462)
- Adaptive bitwidth: int7 for deep Q,K; int5 for early MLP; int6 default
- GPTQ per-row: 5 percentile search per row
- Backout connection: learned residual subtraction from mid-layer
- TTT_ENABLED=0 (rule compliance)
- seq_len=1024, WD=6000, EMA=0.9985, QAT=0.15
- Decoder LR 2x, no grad clip

Expected: 1.111-1.116 BPB (beats non-TTT SOTA 1.123)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants