Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) by gowtham0992 · Pull Request #569 · openai/parameter-golf

gowtham0992 · 2026-03-23T19:57:51Z

Summary

3-seed mean val_bpb: 1.1175 (std 0.0005) | ≤15.94 MB | 8xH100 SXM, 600s | No TTT

Key Techniques

Value Residual Learning (arxiv:2410.17897): Layer 0's V output added to all subsequent layers via learned sigmoid gates. First non-TTT VRL result on standard architecture.
LeakyReLU(0.5)²: Replaces relu² — preserves negative gradient flow
Full GPTQ (IST-DASLab/gptq, ICLR 2023): Hessian-aware int6 quantization with Cholesky inverse error compensation
QAT-export alignment: STE clip quantile(0.9995) matches GPTQ export
2% magnitude pruning: Post-quant zeroing for zstd compressibility

Results

Seed	Pre-quant BPB	Post-quant BPB	Size
42	1.1380	1.1169	15.84 MB
1337	1.1386	1.1176	15.94 MB
2024	1.1390	1.1179	15.64 MB
Mean	1.1385	1.1175

vs SOTA (1.1228): improvement = 0.0053 nats

Reproduction

pip install -r requirements.txt
python3 data/cached_challenge_fineweb.py --variant sp1024
BACKOUT_ENABLED=0 SEED=42 python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Stack of all verified improvements on current SOTA (PR openai#414, 1.1228 bpb): - VRL (Value Residual Learning, arxiv:2410.17897): layer 0 V shared via sigmoid gates - Full GPTQ (Hessian-aware Cholesky int6): -0.0026 bpb over GPTQ-lite - LeakyReLU(0.5)²: -0.0015 bpb - Batched LoRA TTT: rank=8 Q+V+LMhead all 11 layers, 2 epochs cosine LR - Score-before-train every chunk every epoch (backward-looking, fully legal) - EMA(0.997) + Tight SWA + Late QAT@0.15 + XSA-all(11) + Partial RoPE(16/64) - LN Scale + VE128(9,10) + SmearGate + BigramHash(2048) + Prune(2%) Expected: ~1.08–1.10 bpb (non-record pending 3-seed H100 validation) Attribution: signalrush (PR openai#414), gowtham0992 (PR openai#569), MatoTeziTanka (PR openai#512), LoquiAuris (PR openai#548) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BigramHash(8192) gives -0.0025 BPP but 16.37MB > 16MB. PR openai#569 (VRL sigmoid gates) launched for repro.

Pre-eval TTT was non-compliant per issue openai#402. Now uses score-first TTT: score each chunk before training on it. Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535). Score pending rerun with compute credits.

Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Saves layer 0's raw V output and blends it into all subsequent layers via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing). PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT). VRL is orthogonal to our existing VE128 (shared value embedding). Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer (10 params total for 11L). Zero compute overhead beyond the gated blend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#569 VRL produces 16.0-16.2MB models (over budget). openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP. 3-seed validation starting.

…x LeakyReLU author links

@parinzee

…, LeakyReLU from @parinzee + @sofiabod

valerio-oai · 2026-03-24T14:58:16Z

Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more inclined to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

gowtham0992 · 2026-03-24T15:08:22Z

Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more incline to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

Good catch — GPTQ calibration currently runs in an untimed gap after the training loop breaks at 600s and before the eval timer starts. It's ~10-15s of forward passes on training data to collect Hessians for weight rounding (no gradient updates).

I can see how that's ambiguous under the rules, so I'll update the submission to count calibration time inside the 600s training budget. The BPB impact from fewer training steps should be minimal. Will push an updated version shortly.

valerio-oai · 2026-03-24T16:13:39Z

Also, if you don't mind I'll close this PR for now -- feel free to send more messages here but a new submission would have a new submission date, so I'd rather you made a new PR.

- Autoresearch loop (program.md, loop.sh, generate_next.py) - Modal provider for 8xH100 training with checkpoint save/restore - Experiment framework with preflight size checks - eval_ttt.py for TTT evaluation against saved checkpoints - train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning) - train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT) - train_gpt_sota.py: PR openai#573 base - train_gpt_mlx_recurrent.py: depth recurrence experiments - Benchmark scripts for local MLX A/B testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Architecture (matching frontier openai#569 + improvements): - 11 layers (from 10), XSA on last 4 layers - VRL: value residual from layer 1 to all subsequent layers - Partial RoPE: only first 16/64 dims get rotation - LeakyReLU(0.5)² activation (replaces relu²) - LN Scale: 1/sqrt(layer_idx+1) dampens deeper layers Training: - EMA decay=0.997 (replaces SWA) - Late QAT: activates when LR scale < 0.15 during warmdown - Removed MTP (hurt throughput, no BPB gain) - Removed LoRA TTT (legal version only 0.001 BPB improvement) Quantization: - GPTQ-lite: tries 5 clip percentiles per row, picks best MSE - Mixed int5 MLP / int6 attention + zstd-22 1425 lines (75 headroom for Mousse + Hyper-Connections)

gowtham0992 · 2026-03-25T16:27:48Z

Also, if you don't mind I'll close this PR for now -- feel free to send more messages here but a new submission would have a new submission date, so I'd rather you made a new PR.

New submission filed as #738 — GPTQ calibration now inside training budget per #677 ruling, plus novel eval-time innovations (hidden-state kNN-LM + 5-gram cache with pre-committed confidence gate, no safety gate). 3-seed mean 1.0970 BPB.

Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512 Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result Approach C (GEPA int5 + TTT): artifact over 16MB Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this). Only s_0 cumulative first-pass score is legal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Submission train_gpt.py with all 32 techniques from the execution plan, each gated by environment variables (disabled by default) - Optuna-based search framework with validate mode (per-technique smoke test) and search mode (TPE over joint technique + model size space) - Ablation infrastructure (ablation.py, shell scripts) for tracking experiments - PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738) - Execution plan document Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn, PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization (variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time (TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)

1a4f3aa

gowtham0992 force-pushed the submission/11L-VRL-FullGPTQ-LeakyReLU2 branch from f6abfae to 1a4f3aa Compare March 23, 2026 20:02

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026

bg8192 1.1200 but over budget, pr569 running

73b0551

BigramHash(8192) gives -0.0025 BPP but 16.37MB > 16MB. PR openai#569 (VRL sigmoid gates) launched for repro.

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026

add temp scaling T=0.98, openai#569+xsa11 with FA3, running

f05d214

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026

add delta measurement + temp sweep to openai#569 script

2194d19

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026

Fix GPTQ credit: openai#535 @raahilshah + openai#569 @gowtham0992, fi…

63056a4

…x LeakyReLU author links

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026

Fix credits: GPTQ from openai#535 @raahilshah + openai#569 @gowtham0992…

7703e5e

…, LeakyReLU from @parinzee + @sofiabod

valerio-oai closed this Mar 24, 2026

valerio-oai reopened this Mar 24, 2026

valerio-oai closed this Mar 24, 2026

anthony-maio mentioned this pull request Mar 24, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657

Closed

6 tasks

valerio-oai mentioned this pull request Mar 25, 2026

Record: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x #615

Closed

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

gowtham0992 mentioned this pull request Mar 25, 2026

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970) #738

Closed

anthony-maio mentioned this pull request Mar 25, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #175

Open

6 tasks

raahilshah mentioned this pull request Mar 26, 2026

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634

Open

7 tasks

This was referenced Mar 26, 2026

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #376

Closed

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #887

Closed

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889

Open

ibarrajo mentioned this pull request Mar 28, 2026

Non-record: Three Approaches + Lessons Learned (best: 1.1188 BPB) #1001

Open

5 tasks

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-VRL-FullGPTQ-LeakyReLU2

gowtham0992 commented Mar 23, 2026

Uh oh!

valerio-oai commented Mar 24, 2026 •

edited

Loading

Uh oh!

gowtham0992 commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

gowtham0992 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gowtham0992 commented Mar 23, 2026

Summary

Key Techniques

Results

Reproduction

Uh oh!

valerio-oai commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gowtham0992 commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

gowtham0992 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valerio-oai commented Mar 24, 2026 •

edited

Loading