Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) by vimeto · Pull Request #1072 · openai/parameter-golf

vimeto · 2026-03-29T11:34:17Z

Fused LeakyReLU² + Online GPTQ + Parallel Muon

val_bpb: 1.117 (1-seed, stride=16, pending 3-seed confirmation)
Artifact: 15.95 MB (with selective ±1 pruning)
No TTT — pure neural model with sliding window evaluation

Key Innovations

1. Fused Triton MLP Kernel — Custom Triton kernel fusing F.linear → LeakyReLU(0.5) → square into one GPU pass. Eliminates the 1536-dim intermediate tensor write to HBM per layer. Result: 70ms/step (vs 87ms without) on 8xH100 SXM → 33% more training steps in the same wallclock.

2. Online Hessian GPTQ — Hessian matrices (H = X^T X) accumulated during training via separate uncompiled forward passes every 25 steps. Eliminates the train-time vs GPTQ-time tradeoff: full 600s training budget + Full GPTQ quality.

3. Selective ±1 Pruning — After INT6 quantization, adaptively zeros the least-significant ±1 weights (sorted by scale²) to precisely control artifact size to ≤16MB.

Results

Seed	Steps	Step avg	Pre-quant	Sliding BPB	Stride	Artifact
1337	7,904	70.0ms	1.1290	1.1170	16	15.95MB
42	—	—	—	pending	—	—
2025	—	—	—	pending	—	—

3-seed runs pending due to cloud GPU infrastructure instability. Projected 3-seed mean: ~1.117.

Architecture

11L/512d, 8H/4KV GQA, LeakyReLU(0.5)², XSA all 11 layers, BigramHash 4096, VE128 layers 9-10, SmearGate, Partial RoPE 16/64, U-Net skips, LN Scale 1/√(layer+1), logit softcap 30.

Training

Parallel Muon (parameter banking, 3-phase overlapped reduce-scatter/all-gather, no DDP) + Adam. 786K batch, warmdown=3000, QAT@0.5, EMA 0.997, SWA every 50. Online Hessian GPTQ INT6 + LZMA preset=9 + selective ±1 pruning.

Comparison

Entry	Sliding BPB	TTT?
This (projected)	1.117	No
Merged SOTA (PR #549)	1.1194	Yes
PR #549 pre-TTT	1.1218	No

Credits

Built on: PR #549 (Parallel Muon), PR #414 (base arch), PR #198 (XSA), PR #287 (Partial RoPE), PR #493 (LeakyReLU²), modded-nanogpt (fused kernel pattern).

…7 (1-seed, pending 3-seed)

… reset Combines the best of three approaches: PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all PR openai#1072 (1.117): fused Triton MLP (matmul+activation, 70ms/step) Ours: TTT periodic reset (anti-drift) Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations = best training throughput + best quantization + best eval. Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only). Falls back to standard path on non-Hopper GPUs. TTT sweep tests 4 configs on the same trained checkpoint: sota_ttt, pr1039, reset/100, reset/50 Total H100 time: ~10min train + 4×7min TTT ≈ 40 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both from top submissions, zero code risk: MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5 Saves ~1-2ms/step, proven at 1.1086 BPB BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table More n-gram patterns, proven at 1.117 BPB MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.11…

0561eb8

…7 (1-seed, pending 3-seed)

Bortlesboat mentioned this pull request Mar 31, 2026

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean) #1169

Open

6 tasks

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072
vimeto wants to merge 1 commit intoopenai:mainfrom
vimeto:main

vimeto commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vimeto commented Mar 29, 2026

Fused LeakyReLU² + Online GPTQ + Parallel Muon

Key Innovations

Results

Architecture

Training

Comparison

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant