Skip to content

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)#442

Closed
sjp611 wants to merge 1 commit intoopenai:mainfrom
sjp611:submission/11L-EMA-AdamWTTT10ep-1.1027
Closed

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)#442
sjp611 wants to merge 1 commit intoopenai:mainfrom
sjp611:submission/11L-EMA-AdamWTTT10ep-1.1027

Conversation

@sjp611
Copy link
Copy Markdown

@sjp611 sjp611 commented Mar 22, 2026

Summary

Approach

Built on PR #398 (11L EMA + SGD TTT 20ep). Single change: SGD → AdamW for TTT.

-    ttt_lr = float(os.environ.get("TTT_LR", 0.008))
-    ttt_epochs = int(os.environ.get("TTT_EPOCHS", 20))
+    ttt_lr = float(os.environ.get("TTT_LR", 0.0005))
+    ttt_epochs = int(os.environ.get("TTT_EPOCHS", 10))

-    optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum)
+    optimizer = torch.optim.AdamW(ttt_params, lr=args.ttt_lr, weight_decay=0.0)

All other settings identical to PR #398 (11L, EMA 0.997, SmearGate, BigramHash 2048, int6+zstd).

Results (3-seed, sliding window stride=64)

Seed Steps val_bpb
1337 4372 1.1060
42 4578 1.0992
7 4612 1.1030
Mean±Std 1.1027 ± 0.0034

Comparison to prior SOTA (PR #398)

Metric PR #398 Ours
Mean BPB 1.1221 1.1027
Best BPB 1.1213 0.0992
TTT optimizer SGD(lr=0.008, mom=0.9) AdamW(lr=0.0005, wd=0.0)
TTT epochs 20 10
TTT time ~260s ~157s

Run command

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

(all hyperparameters are set as defaults in train_gpt.py)

Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398.
Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026
Si6gma added a commit to Si6gma/parameter-golf that referenced this pull request Mar 22, 2026
- DEEP_RESEARCH_PROMPT.md: Copy-paste prompts for Claude
- RECENT_PR_ANALYSIS.md: Analysis of latest PRs (openai#442-openai#454)
- Research priorities: Catalytic Residuals, Late QAT, 12L
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
het4rk added a commit to het4rk/parameter-golf that referenced this pull request Mar 22, 2026
16.15MB was 153KB over limit with BigramHash(4096).
BigramHash(2048) saves ~260KB, matching the 1.1027 winner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JoeProAI pushed a commit to JoeProAI/parameter-golf that referenced this pull request Mar 22, 2026
Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
JoeProAI added a commit to JoeProAI/parameter-golf that referenced this pull request Mar 22, 2026
Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
het4rk added a commit to het4rk/parameter-golf that referenced this pull request Mar 22, 2026
Built on the 1.1027 BPB winner (PR openai#442) with three novel TTT improvements:

1. Importance-weighted loss: upweight hard tokens (high per-token NLL)
   using importance sampling. Focuses adaptation on where it matters most.
   weights = clamp(nll_i / mean_nll, 0.2, 5.0), normalized to mean=1.

2. Discriminative layer-wise LR: deeper layers get higher LR (0.3x-1.0x
   base). Deep layers are most task-specific, benefit most from adaptation.
   Inspired by ULMFiT discriminative fine-tuning.

3. Cosine LR decay across TTT epochs: prevents overshooting in later
   epochs. Decays from 1.0x to 0.1x over 10 epochs.

Base is PROVEN (11L, 512d, 3x MLP, GQA, SmearGate, BigramHash, EMA,
Muon, int6+zstd-22, sliding eval). Size verified at 15.75MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mohosy pushed a commit to mohosy/parameter-golf that referenced this pull request Mar 23, 2026
Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462):
- SwiGLU FFN with Star-ReLU (hidden=1792)
- U-Net skip connections with learned gating
- EMA (decay=0.9985) replacing SWA
- AdamW TTT (legal score-first protocol)
- Partial RoPE (16 dims)
- LN Scale (1/sqrt(layer_idx+1))
- BigramHash(8192) + SmearGate
- GPTQ-lite quantization
- DDP compile fix for multi-GPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442)
- use DDP model for TTT forward pass to sync gradients across GPUs
- shard validation tokens across ranks for proper distributed TTT
- batch size 4 seqs/GPU, modal timeout 1800s
amaljithkuttamath added a commit to amaljithkuttamath/parameter-golf that referenced this pull request Mar 23, 2026
Value Residual + Gated Attention on PR openai#442 stack.
Single seed (1337), 8xH100 SXM, 14.2 MB artifact.
het4rk added a commit to het4rk/parameter-golf that referenced this pull request Mar 23, 2026
Math validation proved our novel TTT (importance weighting, cosine
decay, discriminative LR) would hurt on a strong base model.
Reverted to flat-LR AdamW TTT which is proven at 1.1027 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Takes PR #462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB)
and adds our proven quantization improvements:

1. GPTQ — Hessian-aware int6 with column reordering + optimal scales
2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps
3. QAT percentile clipping — matches GPTQ export quantizer

Base architecture credit: @JoeProAI (PR #462)
AdamW TTT credit: @sjp611 (PR #442)
GPTQ integration: our contribution

Uses PyTorch native SDPA (no FA3 needed) — runs on any H100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lukacf added a commit to lukacf/parameter-golf-submission that referenced this pull request Mar 23, 2026
3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lukacf added a commit to lukacf/parameter-golf-submission that referenced this pull request Mar 23, 2026
3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Takes PR openai#462's SwiGLU + U-Net + AdamW TTT architecture (1.0672 BPB)
and adds our proven quantization improvements:

1. GPTQ — Hessian-aware int6 with column reordering + optimal scales
2. Earlier QAT — threshold 0.15→0.5 for 3x more QAT steps
3. QAT percentile clipping — matches GPTQ export quantizer

Base architecture credit: @JoeProAI (PR openai#462)
AdamW TTT credit: @sjp611 (PR openai#442)
GPTQ integration: our contribution

Uses PyTorch native SDPA (no FA3 needed) — runs on any H100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition.

RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 25, 2026
- Replace relu().square() with leaky_relu(0.5).square() in MLP
  Expected: -0.0015 BPB (5 independent teams confirm)
- Switch TTT optimizer from SGD to AdamW(lr=0.0005, wd=0, betas=0.9/0.95)
  Expected: stronger TTT adaptation per openai#442/openai#503/openai#545
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
…nai#400 openai#369 openai#398)

KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free)
GPTQ should be per-ROW not per-matrix (-0.0006 BPB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants