Skip to content

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414

Merged
cocohearts merged 1 commit intoopenai:mainfrom
signalrush:submission/ema-gptqlite-1.1233
Mar 23, 2026
Merged

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414
cocohearts merged 1 commit intoopenai:mainfrom
signalrush:submission/ema-gptqlite-1.1233

Conversation

@signalrush
Copy link
Copy Markdown
Contributor

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15

val_bpb: 1.1233 (sliding window stride=64, 3-seed mean) | 15.55 MB (mean) | 8xH100 SXM, 600s

Key Innovations Over PR #374

Change PR #374 This Impact
GPTQ-lite Fixed clip (row max) 5 clip percentiles per row, pick min MSE -0.0006 BPB
EMA (decay=0.997) None (Tight SWA only) EMA every step -0.0006 BPB
Warmdown 3000 3500 -0.0002 BPB
Late QAT threshold 0.1 0.15 -0.0001 BPB
Total 1.1246 1.1233 -0.0013 BPB

GPTQ-lite: Per-Layer Optimal Clip Percentile

Instead of using row maximum for int6 scale, try 5 clip percentiles (0.999, 0.9995, 0.9999, 0.99999, 1.0) per weight matrix row and pick the one minimizing reconstruction MSE. Zero training cost.

Results (3 seeds, 8xH100 SXM)

Seed Steps val_loss Sliding BPB (s64) Artifact
1337 7101 1.8958 1.1228 15.56 MB
42 ~7100 1.8972 1.1236 15.54 MB
2024 ~7100 1.8971 1.1236 15.59 MB

Mean: 1.1233 | Std: 0.0005

Architecture

11L, 512d, 8H/4KV, MLP 3x (relu²), U-Net skips, XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), FA3, Muon WD=0.04, EMA(0.997), Tight SWA, Late QAT@0.15, int6+zstd-22.

Run Command

SEED=1337 bash eval/eval.sh

Test plan

  • All 3 seeds under 16MB
  • All 3 seeds train in 600s on 8xH100
  • Post-quant roundtrip verified
  • Sliding window eval (stride=64) consistent across seeds (std=0.0005)
  • train_gpt.py under 1500 lines (1402)
  • No TTT on validation data

🤖 Generated with Claude Code

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Seed 1337: 81.86ms, 1.1241 bpb, 15.83MB
Seed 42: 81.88ms, 1.1253 bpb, 15.82MB
Seed 2025: 81.86ms, 1.1247 bpb, 15.80MB
Mean: 81.87ms, 1.1247 bpb

Also adds GPTQ-lite (PR openai#414's per-row optimal clip percentile search)
for improved int6 quantization quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 22, 2026
Three stacked innovations:
1. QAT/GPTQ mismatch fix: QAT STE uses percentile clip (0.9995)
   matching GPTQ-lite export, instead of row_max. Model trains
   against the actual quantization it'll face.
2. TrigramHash(2048, dim=48): 3-token n-gram patterns, additive
   to BigramHash. Inspired by PR #440. ~100KB extra params.
3. EMA-SWA blend (80/20): combine exponential and uniform averaging.
   SWA data already collected but unused in PR #414.

Plus proven TTT burst from v1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) on openai#414 stack + Parameter Banking + Parallel Muon.
Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum, 3 epochs/32K chunk, freeze first 2 blocks, cosine LR decay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
Updated CLAUDE.md and idea bank with:
- Current valid leaderboard (PR openai#414 at 1.1233 is the real leader)
- TTT legality analysis (full-val TTT ruled invalid, score-first legal)
- New techniques to adopt: GPTQ-lite, backout, U-Net skips, value residual,
  catalytic residuals, gated attention
- Phased experiment roadmap: parity -> zero-cost arch -> novel quant -> training
- Dead ends confirmed since openai#332: PPM-C, SwiGLU, depth recurrence
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 23, 2026
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE,
LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22).

Added legal score-first TTT from PR openai#461/openai#473 protocol:
- SGD + momentum 0.9, lr=0.002 with cosine decay
- 3 epochs per 32K token chunk
- Freeze blocks 0-1
- Score each chunk BEFORE training on it (inference_mode)
- Expected ~0.002 bpb improvement over base

Strategy shift: reproduce proven frontier instead of iterating on
our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding
legal TTT should push to ~1.121.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
- PR openai#414 repro: 1.1286 BPB warm cache (5903 steps at 98ms)
- stride=32 vs stride=64: negligible (0.0001 BPP)
- VRL net negative on openai#414 stack (+0.0012 BPP)
- catalytic residuals experiment in progress
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 23, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
- XSA on all 11 layers: 1.1268 vs baseline 1.1286 (-0.0018)
- catalytic residuals: neutral on openai#414 stack
- backout connection: slightly negative
- stride=32: negligible gain over stride=64
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
All arch tweaks tested on openai#414 stack:
- XSA-all: -0.0007 BPP (keeping)
- Gated attn, catalytic, VRL, backout: all neutral/negative
- Pod throughput (98-104ms vs 82ms) is the dominant bottleneck
- Moving to quantization improvements
pragnyanramtha referenced this pull request in pragnyanramtha/parameter-golf-beta Mar 23, 2026
Carefully implemented based on signalrush SOTA (#414, 1.1233 BPB):
- 11L: num_layers = 11 (up from 9)
- XSA4: post-attention geometric subtraction on last 4 layers (0 new params)
- EMA: exponential moving average decay=0.997, updated every step from step 0
- Late QAT: quantization-aware training enabled when LR < 15% threshold
- GPTQ-lite: per-row quantization with 99.99% clip percentile

EMA state initialized before training loop, updated after each optimizer step,
applied to weights at end of training. Late QAT uses class-level flag with inline STE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
- 11 layers, XSA on last 4, int6 quantization + zstd-22
- EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15
- Partial RoPE 16/64, LN Scale 1/sqrt(layer+1)
- SmearGate + BigramHash(2048,128), VE128 on layers 9,10
- Muon WD=0.04, momentum=0.99, matrix_lr=0.025
- SDPA fallback (no FA3), batch 786K, seq 2048
- add zstandard to Modal image
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
Benchmarked FA2 vs cuDNN SDPA on H100: FA2 is 0.73ms per call.
11 layers x 2 = 16ms of 98ms step. FA3 would save ~8ms max.
Gap to PR openai#414 (82ms) is likely hardware variation.
Added pod setup script. Updated experiment priorities.
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
- replace relu(x)^2 with leaky_relu(x, 0.5)^2
- PR openai#493 reaches 1.1309 with partial stack using this activation
- untried on full openai#414 stack — could give -0.002 to -0.005 BPB
- zero param cost, zero speed overhead
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
…d mean)

Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0
on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9):
  Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB
  Seed 42:   1.1216 bpb, 406s TTT, 15.99 MB
  Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB
  Mean:      1.1214 (std 0.0009)

All artifacts under 16MB. All eval times under 600s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EconLearn pushed a commit to EconLearn/parameter-golf that referenced this pull request Mar 27, 2026
- Based on PR openai#414 SOTA (1.1228 BPB)
- LeakyReLU(0.5)^2 activation (proven frontier technique)
- XSA on all 11 layers (PR openai#609 approach)
- Full Hessian-aware GPTQ quantization with column reordering
- Binary-search selective pruning to fit under 16MB
- Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100,
  falls back to PyTorch SDP + no compile on T4/P100/L4
- 1500 lines exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MVPandey added a commit to MVPandey/parameter-golf that referenced this pull request Mar 28, 2026
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Mar 29, 2026
- technique-card-head: align-items center, flex-wrap for badge overflow
- All technique card sub-elements: explicit text-align center
- combo-chip: centered text on all child elements
- Top 3 neural leaderboard: corrected types from PR titles
  - openai#549: Neural + TTT (LeakyReLU² + Legal Score-First TTT)
  - openai#414: Neural (EMA + GPTQ-lite, no TTT)
  - openai#641: Neural, Non-Record (Binary U-Net)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
demouo added a commit to demouo/parameter-golf that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants