Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414
Merged
cocohearts merged 1 commit intoopenai:mainfrom Mar 23, 2026
Merged
Conversation
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 22, 2026
Seed 1337: 81.86ms, 1.1241 bpb, 15.83MB Seed 42: 81.88ms, 1.1253 bpb, 15.82MB Seed 2025: 81.86ms, 1.1247 bpb, 15.80MB Mean: 81.87ms, 1.1247 bpb Also adds GPTQ-lite (PR openai#414's per-row optimal clip percentile search) for improved int6 quantization quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 22, 2026
Three stacked innovations: 1. QAT/GPTQ mismatch fix: QAT STE uses percentile clip (0.9995) matching GPTQ-lite export, instead of row_max. Model trains against the actual quantization it'll face. 2. TrigramHash(2048, dim=48): 3-token n-gram patterns, additive to BigramHash. Inspired by PR #440. ~100KB extra params. 3. EMA-SWA blend (80/20): combine exponential and uniform averaging. SWA data already collected but unused in PR #414. Plus proven TTT burst from v1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) on openai#414 stack + Parameter Banking + Parallel Muon. Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum, 3 epochs/32K chunk, freeze first 2 blocks, cosine LR decay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
Updated CLAUDE.md and idea bank with: - Current valid leaderboard (PR openai#414 at 1.1233 is the real leader) - TTT legality analysis (full-val TTT ruled invalid, score-first legal) - New techniques to adopt: GPTQ-lite, backout, U-Net skips, value residual, catalytic residuals, gated attention - Phased experiment roadmap: parity -> zero-cost arch -> novel quant -> training - Dead ends confirmed since openai#332: PPM-C, SwiGLU, depth recurrence
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Mar 23, 2026
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 23, 2026
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 23, 2026
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
- PR openai#414 repro: 1.1286 BPB warm cache (5903 steps at 98ms) - stride=32 vs stride=64: negligible (0.0001 BPP) - VRL net negative on openai#414 stack (+0.0012 BPP) - catalytic residuals experiment in progress
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 23, 2026
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
- XSA on all 11 layers: 1.1268 vs baseline 1.1286 (-0.0018) - catalytic residuals: neutral on openai#414 stack - backout connection: slightly negative - stride=32: negligible gain over stride=64
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
All arch tweaks tested on openai#414 stack: - XSA-all: -0.0007 BPP (keeping) - Gated attn, catalytic, VRL, backout: all neutral/negative - Pod throughput (98-104ms vs 82ms) is the dominant bottleneck - Moving to quantization improvements
pragnyanramtha
referenced
this pull request
in pragnyanramtha/parameter-golf-beta
Mar 23, 2026
Carefully implemented based on signalrush SOTA (#414, 1.1233 BPB): - 11L: num_layers = 11 (up from 9) - XSA4: post-attention geometric subtraction on last 4 layers (0 new params) - EMA: exponential moving average decay=0.997, updated every step from step 0 - Late QAT: quantization-aware training enabled when LR < 15% threshold - GPTQ-lite: per-row quantization with 99.99% clip percentile EMA state initialized before training loop, updated after each optimizer step, applied to weights at end of training. Late QAT uses class-level flag with inline STE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
… local test script
sofiabod
added a commit
to sofiabod/parameter-golf
that referenced
this pull request
Mar 23, 2026
- 11 layers, XSA on last 4, int6 quantization + zstd-22 - EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15 - Partial RoPE 16/64, LN Scale 1/sqrt(layer+1) - SmearGate + BigramHash(2048,128), VE128 on layers 9,10 - Muon WD=0.04, momentum=0.99, matrix_lr=0.025 - SDPA fallback (no FA3), batch 786K, seq 2048 - add zstandard to Modal image
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
…ression moonshots falsified
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
Benchmarked FA2 vs cuDNN SDPA on H100: FA2 is 0.73ms per call. 11 layers x 2 = 16ms of 98ms step. FA3 would save ~8ms max. Gap to PR openai#414 (82ms) is likely hardware variation. Added pod setup script. Updated experiment priorities.
sofiabod
added a commit
to sofiabod/parameter-golf
that referenced
this pull request
Mar 23, 2026
- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
EconLearn
pushed a commit
to EconLearn/parameter-golf
that referenced
this pull request
Mar 27, 2026
- Based on PR openai#414 SOTA (1.1228 BPB) - LeakyReLU(0.5)^2 activation (proven frontier technique) - XSA on all 11 layers (PR openai#609 approach) - Full Hessian-aware GPTQ quantization with column reordering - Binary-search selective pruning to fit under 16MB - Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100, falls back to PyTorch SDP + no compile on T4/P100/L4 - 1500 lines exactly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 tasks
8 tasks
10 tasks
7 tasks
5 tasks
MVPandey
added a commit
to MVPandey/parameter-golf
that referenced
this pull request
Mar 28, 2026
10 tasks
MatoTeziTanka
pushed a commit
to MatoTeziTanka/parameter-golf
that referenced
this pull request
Mar 29, 2026
- technique-card-head: align-items center, flex-wrap for badge overflow - All technique card sub-elements: explicit text-align center - combo-chip: centered text on all child elements - Top 3 neural leaderboard: corrected types from PR titles - openai#549: Neural + TTT (LeakyReLU² + Legal Score-First TTT) - openai#414: Neural (EMA + GPTQ-lite, no TTT) - openai#641: Neural, Non-Record (Binary U-Net) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
demouo
added a commit
to demouo/parameter-golf
that referenced
this pull request
Mar 30, 2026
Open
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15
val_bpb: 1.1233 (sliding window stride=64, 3-seed mean) | 15.55 MB (mean) | 8xH100 SXM, 600s
Key Innovations Over PR #374
GPTQ-lite: Per-Layer Optimal Clip Percentile
Instead of using row maximum for int6 scale, try 5 clip percentiles (0.999, 0.9995, 0.9999, 0.99999, 1.0) per weight matrix row and pick the one minimizing reconstruction MSE. Zero training cost.
Results (3 seeds, 8xH100 SXM)
Mean: 1.1233 | Std: 0.0005
Architecture
11L, 512d, 8H/4KV, MLP 3x (relu²), U-Net skips, XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), FA3, Muon WD=0.04, EMA(0.997), Tight SWA, Late QAT@0.15, int6+zstd-22.
Run Command
Test plan
🤖 Generated with Claude Code