Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535
Closed
raahilshah wants to merge 1 commit intoopenai:mainfrom
Closed
Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535raahilshah wants to merge 1 commit intoopenai:mainfrom
raahilshah wants to merge 1 commit intoopenai:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
PR openai#535 is at 1.1204 with LeakyReLU2 + Full GPTQ + QAT alignment. Starting from their code instead of building from our openai#414 base.
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
6647 steps at 89ms with FA3 Hopper. 15.62MB fits. Beats merged leader openai#414 (1.1228) and unmerged openai#535 (1.1204). 3-seed validation starting.
amaljithkuttamath
added a commit
to amaljithkuttamath/parameter-golf
that referenced
this pull request
Mar 23, 2026
Pre-eval TTT was non-compliant per issue openai#402. Now uses score-first TTT: score each chunk before training on it. Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535). Score pending rerun with compute credits.
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 23, 2026
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Mar 24, 2026
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 24, 2026
openai#569 VRL produces 16.0-16.2MB models (over budget). openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP. 3-seed validation starting.
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 24, 2026
…x LeakyReLU author links
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 24, 2026
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Ruled invalid due to GPTQ calibration on training data during eval phase; closing in favour of newer PRs that perform GPTQ @0hq am I correct in understanding that GPTQ calibration on |
3 tasks
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment
val_bpb: 1.1204 (3-seed mean, std 0.0001) | 15.85 MB max artifact | 8xH100 SXM, 600s
Key Innovations
Results (3 seeds, 8xH100 SXM)
Mean: 1.1204 | Std: 0.0001
Architecture
11L, 512d, 8H/4KV (GQA), MLP 3x with LeakyReLU(0.5)², XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ (Hessian-aware int6), zstd-22, FA3 Hopper.
Verification