Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204) by raahilshah · Pull Request #535 · openai/parameter-golf

raahilshah · 2026-03-23T14:53:58Z

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment

val_bpb: 1.1204 (3-seed mean, std 0.0001) | 15.85 MB max artifact | 8xH100 SXM, 600s

Key Innovations

LeakyReLU(0.5)² replacing relu² — prevents dead neurons, doubles effective MLP capacity (-0.0015 BPB)
Full GPTQ with Hessian calibration — 31% quant gap reduction vs percentile search (-0.0026 BPB)
QAT-export alignment — quantile(0.9995) clipping matches STE to export quantizer (-0.0005 BPB)

Results (3 seeds, 8xH100 SXM)

Seed	val_loss	Sliding BPB (s64)	Artifact
7	1.8915	1.1203	15,762,694 bytes
314	1.8919	1.1205	15,732,473 bytes
2024	1.8917	1.1204	15,851,228 bytes

Mean: 1.1204 | Std: 0.0001

Architecture

11L, 512d, 8H/4KV (GQA), MLP 3x with LeakyReLU(0.5)², XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ (Hessian-aware int6), zstd-22, FA3 Hopper.

Verification

3 seeds, all train ≤600s on 8xH100 SXM
All artifacts ≤16,000,000 bytes (max: 15,851,228)
No TTT on validation data
No network calls during evaluation
Sliding window eval stride=64, consistent across seeds (std=0.0001)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#535 is at 1.1204 with LeakyReLU2 + Full GPTQ + QAT alignment. Starting from their code instead of building from our openai#414 base.

6647 steps at 89ms with FA3 Hopper. 15.62MB fits. Beats merged leader openai#414 (1.1228) and unmerged openai#535 (1.1204). 3-seed validation starting.

Pre-eval TTT was non-compliant per issue openai#402. Now uses score-first TTT: score each chunk before training on it. Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535). Score pending rerun with compute credits.

…by 0.0002

- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#569 VRL produces 16.0-16.2MB models (over budget). openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP. 3-seed validation starting.

…x LeakyReLU author links

Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@parinzee

…, LeakyReLU from @parinzee + @sofiabod

- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

raahilshah · 2026-03-25T23:33:46Z

Ruled invalid due to GPTQ calibration on training data during eval phase; closing in favour of newer PRs that perform GPTQ

@0hq am I correct in understanding that GPTQ calibration on fineweb_train_* if done at training time within the 600s training budget is acceptable?

Add record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb=1.1204)

501566f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026

BEST: 1.1202 fits 15.55MB, beats openai#414 by 0.003, beats openai#535 …

aa945d7

…by 0.0002

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026

Fix GPTQ credit: openai#535 @raahilshah + openai#569 @gowtham0992, fi…

63056a4

…x LeakyReLU author links

anantdgoel mentioned this pull request Mar 24, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026

Fix credits: GPTQ from openai#535 @raahilshah + openai#569 @gowtham0992…

7703e5e

…, LeakyReLU from @parinzee + @sofiabod

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

raahilshah closed this Mar 25, 2026

pentxayc mentioned this pull request Mar 26, 2026

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803

Open

3 tasks

raahilshah mentioned this pull request Mar 26, 2026

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634

Open

7 tasks

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

This was referenced Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535
raahilshah wants to merge 1 commit intoopenai:mainfrom
raahilshah:submission/2026-03-23_11L_LeakyReLU_GPTQ_QATalign

raahilshah commented Mar 23, 2026

Uh oh!

raahilshah commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raahilshah commented Mar 23, 2026

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment

Key Innovations

Results (3 seeds, 8xH100 SXM)

Architecture

Verification

Uh oh!

raahilshah commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant