Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) by signalrush · Pull Request #414 · openai/parameter-golf

signalrush · 2026-03-22T07:48:40Z

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15

val_bpb: 1.1233 (sliding window stride=64, 3-seed mean) | 15.55 MB (mean) | 8xH100 SXM, 600s

Key Innovations Over PR #374

Change	PR #374	This	Impact
GPTQ-lite	Fixed clip (row max)	5 clip percentiles per row, pick min MSE	-0.0006 BPB
EMA (decay=0.997)	None (Tight SWA only)	EMA every step	-0.0006 BPB
Warmdown	3000	3500	-0.0002 BPB
Late QAT threshold	0.1	0.15	-0.0001 BPB
Total	1.1246	1.1233	-0.0013 BPB

GPTQ-lite: Per-Layer Optimal Clip Percentile

Instead of using row maximum for int6 scale, try 5 clip percentiles (0.999, 0.9995, 0.9999, 0.99999, 1.0) per weight matrix row and pick the one minimizing reconstruction MSE. Zero training cost.

Results (3 seeds, 8xH100 SXM)

Seed	Steps	val_loss	Sliding BPB (s64)	Artifact
1337	7101	1.8958	1.1228	15.56 MB
42	~7100	1.8972	1.1236	15.54 MB
2024	~7100	1.8971	1.1236	15.59 MB

Mean: 1.1233 | Std: 0.0005

Architecture

11L, 512d, 8H/4KV, MLP 3x (relu²), U-Net skips, XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), FA3, Muon WD=0.04, EMA(0.997), Tight SWA, Late QAT@0.15, int6+zstd-22.

Run Command

SEED=1337 bash eval/eval.sh

Test plan

All 3 seeds under 16MB
All 3 seeds train in 600s on 8xH100
Post-quant roundtrip verified
Sliding window eval (stride=64) consistent across seeds (std=0.0005)
train_gpt.py under 1500 lines (1402)
No TTT on validation data

🤖 Generated with Claude Code

…, 3-seed mean)

Seed 1337: 81.86ms, 1.1241 bpb, 15.83MB Seed 42: 81.88ms, 1.1253 bpb, 15.82MB Seed 2025: 81.86ms, 1.1247 bpb, 15.80MB Mean: 81.87ms, 1.1247 bpb Also adds GPTQ-lite (PR openai#414's per-row optimal clip percentile search) for improved int6 quantization quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three stacked innovations: 1. QAT/GPTQ mismatch fix: QAT STE uses percentile clip (0.9995) matching GPTQ-lite export, instead of row_max. Model trains against the actual quantization it'll face. 2. TrigramHash(2048, dim=48): 3-token n-gram patterns, additive to BigramHash. Inspired by PR #440. ~100KB extra params. 3. EMA-SWA blend (80/20): combine exponential and uniform averaging. SWA data already collected but unused in PR #414. Plus proven TTT burst from v1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Legal score-first TTT (PR openai#461 recipe) on openai#414 stack + Parameter Banking + Parallel Muon. Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum, 3 epochs/32K chunk, freeze first 2 blocks, cosine LR decay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated CLAUDE.md and idea bank with: - Current valid leaderboard (PR openai#414 at 1.1233 is the real leader) - TTT legality analysis (full-val TTT ruled invalid, score-first legal) - New techniques to adopt: GPTQ-lite, backout, U-Net skips, value residual, catalytic residuals, gated attention - Phased experiment roadmap: parity -> zero-cost arch -> novel quant -> training - Dead ends confirmed since openai#332: PPM-C, SwiGLU, depth recurrence

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…icating PR openai#414)

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…T stack)

- PR openai#414 repro: 1.1286 BPB warm cache (5903 steps at 98ms) - stride=32 vs stride=64: negligible (0.0001 BPP) - VRL net negative on openai#414 stack (+0.0012 BPP) - catalytic residuals experiment in progress

…time

- XSA on all 11 layers: 1.1268 vs baseline 1.1286 (-0.0018) - catalytic residuals: neutral on openai#414 stack - backout connection: slightly negative - stride=32: negligible gain over stride=64

All arch tweaks tested on openai#414 stack: - XSA-all: -0.0007 BPP (keeping) - Gated attn, catalytic, VRL, backout: all neutral/negative - Pod throughput (98-104ms vs 82ms) is the dominant bottleneck - Moving to quantization improvements

Carefully implemented based on signalrush SOTA (#414, 1.1233 BPB): - 11L: num_layers = 11 (up from 9) - XSA4: post-attention geometric subtraction on last 4 layers (0 new params) - EMA: exponential moving average decay=0.997, updated every step from step 0 - Late QAT: quantization-aware training enabled when LR < 15% threshold - GPTQ-lite: per-row quantization with 99.99% clip percentile EMA state initialized before training loop, updated after each optimizer step, applied to weights at end of training. Late QAT uses class-level flag with inline STE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… local test script

- 11 layers, XSA on last 4, int6 quantization + zstd-22 - EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15 - Partial RoPE 16/64, LN Scale 1/sqrt(layer+1) - SmearGate + BigramHash(2048,128), VE128 on layers 9,10 - Muon WD=0.04, momentum=0.99, matrix_lr=0.025 - SDPA fallback (no FA3), batch 786K, seq 2048 - add zstandard to Modal image

…#414)

…ression moonshots falsified

Benchmarked FA2 vs cuDNN SDPA on H100: FA2 is 0.73ms per call. 11 layers x 2 = 16ms of 98ms step. FA3 would save ~8ms max. Gap to PR openai#414 (82ms) is likely hardware variation. Added pod setup script. Updated experiment priorities.

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Based on PR openai#414 SOTA (1.1228 BPB) - LeakyReLU(0.5)^2 activation (proven frontier technique) - XSA on all 11 layers (PR openai#609 approach) - Full Hessian-aware GPTQ quantization with column reordering - Binary-search selective pruning to fit under 16MB - Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100, falls back to PyTorch SDP + no compile on T4/P100/L4 - 1500 lines exactly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- technique-card-head: align-items center, flex-wrap for badge overflow - All technique card sub-elements: explicit text-align center - combo-chip: centered text on all child elements - Top 3 neural leaderboard: corrected types from PR titles - openai#549: Neural + TTT (LeakyReLU² + Legal Score-First TTT) - openai#414: Neural (EMA + GPTQ-lite, no TTT) - openai#641: Neural, Non-Record (Binary U-Net) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ai#414 tuning)

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233…

15776db

…, 3-seed mean)

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

newjordan mentioned this pull request Mar 22, 2026

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval) #445

Closed

6 tasks

abaybektursun mentioned this pull request Mar 22, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026

feat: non-TTT submission — GPTQ-lite + VE128 + EMA+SWA stacking (repl…

b6b8ca8

…icating PR openai#414)

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026

feat: PR openai#414 base + Value Residual + TrigramHash (novel non-TT…

c4526a8

…T stack)

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026

debug: test PR openai#414 unmodified code to establish baseline step …

08cd2fc

…time

newjordan mentioned this pull request Mar 23, 2026

GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215 #508

Closed

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026

Fix SWA: lr_scale trigger + every 50 steps (match PR openai#414), add…

a4dd314

… local test script

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026

Add depth-scaled projection init (1/sqrt(2*layers), matches PR openai…

3d08a10

…#414)

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026

Document findings: causal TTT results, PR openai#414 comparison, comp…

0b27c0e

…ression moonshots falsified

sofiabod mentioned this pull request Mar 23, 2026

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518

Closed

pragnyanramtha mentioned this pull request Mar 23, 2026

Record: 11L + XSA4 + EMA + Late QAT + GPTQ-lite (1.1325 BPB) #531

Closed

dentity007 mentioned this pull request Mar 27, 2026

Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed) #968

Open

5 tasks

Idan3011 mentioned this pull request Mar 27, 2026

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972

Closed

anthony-maio mentioned this pull request Mar 27, 2026

Non-record: Random Linear Map Adapter Projections — 1.21MB artifact (val_bpb=1.6542) #974

Open

sofiabod mentioned this pull request Mar 27, 2026

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean) #986

Open

9 tasks

ymrohit mentioned this pull request Mar 27, 2026

Record-track submission: 11L XSA4 + Late Shared Workspace Adapter (LSWA-64x4) + MLP2.5 #988

Closed

aerosta mentioned this pull request Mar 28, 2026

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed #993

Closed

Idan3011 mentioned this pull request Mar 28, 2026

Pre-Enrichment + EMA + SmearGate + XSA4 (val_bpb=1.1470(mean), … #996

Open

aamodbhatt mentioned this pull request Mar 28, 2026

Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) #999

Closed

8 tasks

NewyorkDev mentioned this pull request Mar 28, 2026

1.1085 BPB: JEPA + AdamW TTT + Full GPTQ + FA3 + LZMA #1006

Open

github-actions bot mentioned this pull request Mar 28, 2026

[agentic candidate] Explore a GQA-aligned headwise RoPE ramp candidate nikhil-pandey-gh/parameter-golf#244

Open

NoesisGenesis mentioned this pull request Mar 28, 2026

A Field Guide to Valid Submissions #1017

Open

sofiabod mentioned this pull request Mar 28, 2026

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean) #1030

Open

10 tasks

Naazimsnh02 mentioned this pull request Mar 28, 2026

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT #1033

Open

7 tasks

TimPietruskyRunPod mentioned this pull request Mar 28, 2026

Record: Muon TTT + Entropy-Adaptive Epochs — val_bpb 1.1179 (3-seed mean) #1037

Closed

5 tasks

MVPandey added a commit to MVPandey/parameter-golf that referenced this pull request Mar 28, 2026

energy correction on PR openai#414 baseline (11L, int6, EMA, GPTQ-lite)

0491e0b

Hilo-Hilo mentioned this pull request Mar 28, 2026

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings #1045

Open

sofiabod mentioned this pull request Mar 29, 2026

Record: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0180 (3-seed mean) #1056

Open

10 tasks

Programmerryoki mentioned this pull request Mar 29, 2026

11L MLP2x + LeakyReLU² + Legal TTT (val_bpb=1.2201, 3-seed mean, std=0.0015) #1057

Open

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Anubhav ctw submission #1011

Open

vimeto mentioned this pull request Mar 29, 2026

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) #1072

Draft

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

teddyoweh mentioned this pull request Mar 29, 2026

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB #1092

Open

michaelwinczuk mentioned this pull request Mar 29, 2026

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) #1094

Open

3 tasks

demouo added a commit to demouo/parameter-golf that referenced this pull request Mar 30, 2026

experiment: warmdown=5000 matrix_lr=0.07 mom_warmup=800 (from PR open…

f5728ce

…ai#414 tuning)

minh-stakc mentioned this pull request Mar 30, 2026

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200) #1114

Open

AnubhavBharadwaaj mentioned this pull request Mar 30, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

aamodbhatt mentioned this pull request Mar 30, 2026

Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) #1148

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414
cocohearts merged 1 commit intoopenai:mainfrom
signalrush:submission/ema-gptqlite-1.1233

signalrush commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

signalrush commented Mar 22, 2026