Skip to content

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569

Closed
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-VRL-FullGPTQ-LeakyReLU2
Closed

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-VRL-FullGPTQ-LeakyReLU2

Conversation

@gowtham0992
Copy link
Copy Markdown

Summary

3-seed mean val_bpb: 1.1175 (std 0.0005) | ≤15.94 MB | 8xH100 SXM, 600s | No TTT

Key Techniques

  • Value Residual Learning (arxiv:2410.17897): Layer 0's V output added to all subsequent layers via learned sigmoid gates. First non-TTT VRL result on standard architecture.
  • LeakyReLU(0.5)²: Replaces relu² — preserves negative gradient flow
  • Full GPTQ (IST-DASLab/gptq, ICLR 2023): Hessian-aware int6 quantization with Cholesky inverse error compensation
  • QAT-export alignment: STE clip quantile(0.9995) matches GPTQ export
  • 2% magnitude pruning: Post-quant zeroing for zstd compressibility

Results

Seed Pre-quant BPB Post-quant BPB Size
42 1.1380 1.1169 15.84 MB
1337 1.1386 1.1176 15.94 MB
2024 1.1390 1.1179 15.64 MB
Mean 1.1385 1.1175

vs SOTA (1.1228): improvement = 0.0053 nats

Reproduction

pip install -r requirements.txt
python3 data/cached_challenge_fineweb.py --variant sp1024
BACKOUT_ENABLED=0 SEED=42 python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

@gowtham0992 gowtham0992 force-pushed the submission/11L-VRL-FullGPTQ-LeakyReLU2 branch from f6abfae to 1a4f3aa Compare March 23, 2026 20:02
ADIITJ added a commit to ADIITJ/parameter-golf that referenced this pull request Mar 23, 2026
Stack of all verified improvements on current SOTA (PR openai#414, 1.1228 bpb):
- VRL (Value Residual Learning, arxiv:2410.17897): layer 0 V shared via sigmoid gates
- Full GPTQ (Hessian-aware Cholesky int6): -0.0026 bpb over GPTQ-lite
- LeakyReLU(0.5)²: -0.0015 bpb
- Batched LoRA TTT: rank=8 Q+V+LMhead all 11 layers, 2 epochs cosine LR
- Score-before-train every chunk every epoch (backward-looking, fully legal)
- EMA(0.997) + Tight SWA + Late QAT@0.15 + XSA-all(11) + Partial RoPE(16/64)
- LN Scale + VE128(9,10) + SmearGate + BigramHash(2048) + Prune(2%)

Expected: ~1.08–1.10 bpb (non-record pending 3-seed H100 validation)

Attribution: signalrush (PR openai#414), gowtham0992 (PR openai#569),
MatoTeziTanka (PR openai#512), LoquiAuris (PR openai#548)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
BigramHash(8192) gives -0.0025 BPP but 16.37MB > 16MB.
PR openai#569 (VRL sigmoid gates) launched for repro.
amaljithkuttamath added a commit to amaljithkuttamath/parameter-golf that referenced this pull request Mar 23, 2026
Pre-eval TTT was non-compliant per issue openai#402. Now uses
score-first TTT: score each chunk before training on it.
Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535).
Score pending rerun with compute credits.
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 23, 2026
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 24, 2026
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb
from this change. LeakyReLU preserves gradient flow through negative
pre-activations while maintaining the sparsity/gating benefits of
squaring. At 22M params, dead neurons from hard ReLU are expensive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 24, 2026
Saves layer 0's raw V output and blends it into all subsequent layers
via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing).
PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT).
VRL is orthogonal to our existing VE128 (shared value embedding).

Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer
(10 params total for 11L). Zero compute overhead beyond the gated blend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026
openai#569 VRL produces 16.0-16.2MB models (over budget).
openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP.
3-seed validation starting.
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 24, 2026
@valerio-oai
Copy link
Copy Markdown
Contributor

valerio-oai commented Mar 24, 2026

Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more inclined to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

@valerio-oai valerio-oai reopened this Mar 24, 2026
@gowtham0992
Copy link
Copy Markdown
Author

Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more incline to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

Good catch — GPTQ calibration currently runs in an untimed gap after the training loop breaks at 600s and before the eval timer starts. It's ~10-15s of forward passes on training data to collect Hessians for weight rounding (no gradient updates).

I can see how that's ambiguous under the rules, so I'll update the submission to count calibration time inside the 600s training budget. The BPB impact from fewer training steps should be minimal. Will push an updated version shortly.

@valerio-oai
Copy link
Copy Markdown
Contributor

Also, if you don't mind I'll close this PR for now -- feel free to send more messages here but a new submission would have a new submission date, so I'd rather you made a new PR.

nishant-resolve-ai pushed a commit to nishant-resolve-ai/parameter-golf that referenced this pull request Mar 24, 2026
- Autoresearch loop (program.md, loop.sh, generate_next.py)
- Modal provider for 8xH100 training with checkpoint save/restore
- Experiment framework with preflight size checks
- eval_ttt.py for TTT evaluation against saved checkpoints
- train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning)
- train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT)
- train_gpt_sota.py: PR openai#573 base
- train_gpt_mlx_recurrent.py: depth recurrence experiments
- Benchmark scripts for local MLX A/B testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kelvinyuefanli added a commit to kelvinyuefanli/parameter-golf that referenced this pull request Mar 24, 2026
Architecture (matching frontier openai#569 + improvements):
- 11 layers (from 10), XSA on last 4 layers
- VRL: value residual from layer 1 to all subsequent layers
- Partial RoPE: only first 16/64 dims get rotation
- LeakyReLU(0.5)² activation (replaces relu²)
- LN Scale: 1/sqrt(layer_idx+1) dampens deeper layers

Training:
- EMA decay=0.997 (replaces SWA)
- Late QAT: activates when LR scale < 0.15 during warmdown
- Removed MTP (hurt throughput, no BPB gain)
- Removed LoRA TTT (legal version only 0.001 BPB improvement)

Quantization:
- GPTQ-lite: tries 5 clip percentiles per row, picks best MSE
- Mixed int5 MLP / int6 attention + zstd-22

1425 lines (75 headroom for Mousse + Hyper-Connections)
@gowtham0992
Copy link
Copy Markdown
Author

Also, if you don't mind I'll close this PR for now -- feel free to send more messages here but a new submission would have a new submission date, so I'd rather you made a new PR.

New submission filed as #738 — GPTQ calibration now inside training budget per #677 ruling, plus novel eval-time innovations (hidden-state kNN-LM + 5-gram cache with pre-committed confidence gate, no safety gate). 3-seed mean 1.0970 BPB.

ibarrajo added a commit to ibarrajo/parameter-golf that referenced this pull request Mar 28, 2026
Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512
Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result
Approach C (GEPA int5 + TTT): artifact over 16MB

Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this).
Only s_0 cumulative first-pass score is legal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MichaelMcCulloch pushed a commit to MichaelMcCulloch/parameter-golf that referenced this pull request Mar 29, 2026
- Submission train_gpt.py with all 32 techniques from the execution plan,
  each gated by environment variables (disabled by default)
- Optuna-based search framework with validate mode (per-technique smoke test)
  and search mode (TPE over joint technique + model size space)
- Ablation infrastructure (ablation.py, shell scripts) for tracking experiments
- PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738)
- Execution plan document

Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn,
PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization
(variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time
(TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants