Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) by sjp611 · Pull Request #442 · openai/parameter-golf

sjp611 · 2026-03-22T17:07:13Z

Summary

Replace SGD with AdamW for test-time training (3-line diff from PR Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) #398)
Mean val_bpb 1.1027 (3-seed), best 1.0992 — beats prior SOTA 1.1213 by -0.019
TTT time reduced from ~260s to ~157s (10 epochs vs 20)

Approach

Built on PR #398 (11L EMA + SGD TTT 20ep). Single change: SGD → AdamW for TTT.

-    ttt_lr = float(os.environ.get("TTT_LR", 0.008))
-    ttt_epochs = int(os.environ.get("TTT_EPOCHS", 20))
+    ttt_lr = float(os.environ.get("TTT_LR", 0.0005))
+    ttt_epochs = int(os.environ.get("TTT_EPOCHS", 10))

-    optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum)
+    optimizer = torch.optim.AdamW(ttt_params, lr=args.ttt_lr, weight_decay=0.0)

All other settings identical to PR #398 (11L, EMA 0.997, SmearGate, BigramHash 2048, int6+zstd).

Results (3-seed, sliding window stride=64)

Seed	Steps	val_bpb
1337	4372	1.1060
42	4578	1.0992
7	4612	1.1030
Mean±Std		1.1027 ± 0.0034

Comparison to prior SOTA (PR #398)

Metric	PR #398	Ours
Mean BPB	1.1221	1.1027
Best BPB	1.1213	0.0992
TTT optimizer	SGD(lr=0.008, mom=0.9)	AdamW(lr=0.0005, wd=0.0)
TTT epochs	20	10
TTT time	~260s	~157s

Run command

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

(all hyperparameters are set as defaults in train_gpt.py)

Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398. Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.

… 20)

- DEEP_RESEARCH_PROMPT.md: Copy-paste prompts for Claude - RECENT_PR_ANALYSIS.md: Analysis of latest PRs (openai#442-openai#454) - Research priorities: Catalytic Residuals, Late QAT, 12L

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

16.15MB was 153KB over limit with BigramHash(4096). BigramHash(2048) saves ~260KB, matching the 1.1027 winner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@sjp611

Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.

@sjp611

Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.

Built on the 1.1027 BPB winner (PR openai#442) with three novel TTT improvements: 1. Importance-weighted loss: upweight hard tokens (high per-token NLL) using importance sampling. Focuses adaptation on where it matters most. weights = clamp(nll_i / mean_nll, 0.2, 5.0), normalized to mean=1. 2. Discriminative layer-wise LR: deeper layers get higher LR (0.3x-1.0x base). Deep layers are most task-specific, benefit most from adaptation. Inspired by ULMFiT discriminative fine-tuning. 3. Cosine LR decay across TTT epochs: prevents overshooting in later epochs. Decays from 1.0x to 0.1x over 10 epochs. Base is PROVEN (11L, 512d, 3x MLP, GQA, SmearGate, BigramHash, EMA, Muon, int6+zstd-22, sliding eval). Size verified at 15.75MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462): - SwiGLU FFN with Star-ReLU (hidden=1792) - U-Net skip connections with learned gating - EMA (decay=0.9985) replacing SWA - AdamW TTT (legal score-first protocol) - Partial RoPE (16 dims) - LN Scale (1/sqrt(layer_idx+1)) - BigramHash(8192) + SmearGate - GPTQ-lite quantization - DDP compile fix for multi-GPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s

Value Residual + Gated Attention on PR openai#442 stack. Single seed (1337), 8xH100 SXM, 14.2 MB artifact.

Math validation proved our novel TTT (importance weighting, cosine decay, discriminative LR) would hurt on a strong base model. Reverted to flat-LR AdamW TTT which is proven at 1.1027 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@JoeProAI

Takes PR #462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB) and adds our proven quantization improvements: 1. GPTQ — Hessian-aware int6 with column reordering + optimal scales 2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps 3. QAT percentile clipping — matches GPTQ export quantizer Base architecture credit: @JoeProAI (PR #462) AdamW TTT credit: @sjp611 (PR #442) GPTQ integration: our contribution Uses PyTorch native SDPA (no FA3 needed) — runs on any H100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).

@JoeProAI

Takes PR openai#462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB) and adds our proven quantization improvements: 1. GPTQ — Hessian-aware int6 with column reordering + optimal scales 2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps 3. QAT percentile clipping — matches GPTQ export quantizer Base architecture credit: @JoeProAI (PR openai#462) AdamW TTT credit: @sjp611 (PR openai#442) GPTQ integration: our contribution Uses PyTorch native SDPA (no FA3 needed) — runs on any H100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-24T14:02:36Z

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition.

- Replace relu().square() with leaky_relu(0.5).square() in MLP Expected: -0.0015 BPB (5 independent teams confirm) - Switch TTT optimizer from SGD to AdamW(lr=0.0005, wd=0, betas=0.9/0.95) Expected: stronger TTT adaptation per openai#442/openai#503/openai#545

…nai#400 openai#369 openai#398) KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free) GPTQ should be per-ROW not per-matrix (-0.0006 BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)

aef01ee

Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398. Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026

Switch TTT to AdamW (PR openai#442: -0.019 BPB over SGD, 10 epochs vs…

3931c8f

… 20)

JoeProAI mentioned this pull request Mar 22, 2026

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) #462

Closed

mohosy mentioned this pull request Mar 23, 2026

Non-record: 11L SwiGLU + XSA4 + EMA + U-Net + AdamW TTT (pending compute) #291

Open

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

ndokutovich mentioned this pull request Mar 23, 2026

Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879) #486

Closed

amaljithkuttamath added a commit to amaljithkuttamath/parameter-golf that referenced this pull request Mar 23, 2026

Record: 11L VR+GA + EMA + AdamW TTT (val_bpb=1.0891)

515f6e8

Value Residual + Gated Attention on PR openai#442 stack. Single seed (1337), 8xH100 SXM, 14.2 MB artifact.

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

abaybektursun mentioned this pull request Mar 23, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

andrewbaggio1 mentioned this pull request Mar 23, 2026

Non-record: Cosine TTT 30ep on SwiGLU + U-Net (1xH100, val_bpb=1.1175) #509

Closed

4 tasks

lukacf mentioned this pull request Mar 23, 2026

Record*: val_bpb=0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine *leaky* TTT) #517

Closed

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537

Open

LoquiAuris mentioned this pull request Mar 23, 2026

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds) #548

Closed

LoquiAuris mentioned this pull request Mar 24, 2026

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100) #595

Closed

valerio-oai closed this Mar 24, 2026

This was referenced Mar 25, 2026

Non-record: 30ep Cosine TTT on SwiGLU + U-Net (1xH100, val_bpb=1.1175) #661

Closed

Record: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781) #672

Closed

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

andrewbaggio1 mentioned this pull request Mar 25, 2026

Record: Chained TTT — Cosine Recovery + Multi-Pass Scoring (3-seed mean val_bpb=1.0366) #685

Closed

4 tasks

andrewbaggio1 mentioned this pull request Mar 25, 2026

Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850) #741

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)#442

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)#442
sjp611 wants to merge 1 commit intoopenai:mainfrom
sjp611:submission/11L-EMA-AdamWTTT10ep-1.1027

sjp611 commented Mar 22, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjp611 commented Mar 22, 2026

Summary

Approach

Results (3-seed, sliding window stride=64)

Comparison to prior SOTA (PR #398)

Run command

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants