Skip to content

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535

Closed
raahilshah wants to merge 1 commit intoopenai:mainfrom
raahilshah:submission/2026-03-23_11L_LeakyReLU_GPTQ_QATalign
Closed

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)#535
raahilshah wants to merge 1 commit intoopenai:mainfrom
raahilshah:submission/2026-03-23_11L_LeakyReLU_GPTQ_QATalign

Conversation

@raahilshah
Copy link
Copy Markdown

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment

val_bpb: 1.1204 (3-seed mean, std 0.0001) | 15.85 MB max artifact | 8xH100 SXM, 600s

Key Innovations

  1. LeakyReLU(0.5)² replacing relu² — prevents dead neurons, doubles effective MLP capacity (-0.0015 BPB)
  2. Full GPTQ with Hessian calibration — 31% quant gap reduction vs percentile search (-0.0026 BPB)
  3. QAT-export alignment — quantile(0.9995) clipping matches STE to export quantizer (-0.0005 BPB)

Results (3 seeds, 8xH100 SXM)

Seed val_loss Sliding BPB (s64) Artifact
7 1.8915 1.1203 15,762,694 bytes
314 1.8919 1.1205 15,732,473 bytes
2024 1.8917 1.1204 15,851,228 bytes

Mean: 1.1204 | Std: 0.0001

Architecture

11L, 512d, 8H/4KV (GQA), MLP 3x with LeakyReLU(0.5)², XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ (Hessian-aware int6), zstd-22, FA3 Hopper.

Verification

  • 3 seeds, all train ≤600s on 8xH100 SXM
  • All artifacts ≤16,000,000 bytes (max: 15,851,228)
  • No TTT on validation data
  • No network calls during evaluation
  • Sliding window eval stride=64, consistent across seeds (std=0.0001)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
PR openai#535 is at 1.1204 with LeakyReLU2 + Full GPTQ + QAT alignment.
Starting from their code instead of building from our openai#414 base.
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
6647 steps at 89ms with FA3 Hopper. 15.62MB fits.
Beats merged leader openai#414 (1.1228) and unmerged openai#535 (1.1204).
3-seed validation starting.
amaljithkuttamath added a commit to amaljithkuttamath/parameter-golf that referenced this pull request Mar 23, 2026
Pre-eval TTT was non-compliant per issue openai#402. Now uses
score-first TTT: score each chunk before training on it.
Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535).
Score pending rerun with compute credits.
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 23, 2026
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data
- gptq_quantize_weight(): column-wise int6 with Cholesky error compensation
- _find_best_row_scales(): percentile search for optimal per-row scales
- Integrated into mixed_quantize_int6() — falls back to naive when no Hessian
- Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 24, 2026
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb
from this change. LeakyReLU preserves gradient flow through negative
pre-activations while maintaining the sparsity/gating benefits of
squaring. At 22M params, dead neurons from hard ReLU are expensive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026
openai#569 VRL produces 16.0-16.2MB models (over budget).
openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP.
3-seed validation starting.
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data
- gptq_quantize_weight(): column-wise int6 with Cholesky error compensation
- _find_best_row_scales(): percentile search for optimal per-row scales
- Integrated into mixed_quantize_int6() — falls back to naive when no Hessian
- Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raahilshah
Copy link
Copy Markdown
Author

Ruled invalid due to GPTQ calibration on training data during eval phase; closing in favour of newer PRs that perform GPTQ

@0hq am I correct in understanding that GPTQ calibration on fineweb_train_* if done at training time within the 600s training budget is acceptable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant