Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) by parinzee · Pull Request #493 · openai/parameter-golf

parinzee · 2026-03-23T02:34:50Z

Summary

val_bpb: 1.1309 (mean of 3 seeds, std: 0.00017)
11 layers, 512 dim, 8H/4KV GQA
Artifact: ~15.8 MB (all seeds under 16 MB)

3-Seed Results

Seed	val_bpb	artifact_bytes
42	1.13109	15,764,564
1337	1.13085	15,626,741
2024	1.13067	15,923,256
Mean	1.13087
Std	0.00017

Key Changes from Baseline

11 layers (up from 10), 512 dim, 8 heads / 4 KV heads (GQA)
XSA (Exclusive Self Attention) on last 4 layers for better representation
LeakyReLU(0.5)² activation — squared leaky ReLU with 0.5 negative slope
Partial RoPE — only 16/64 dims use rotary embeddings
EMA weight averaging (decay=0.997) for smoother final weights
Int6 quantization for all large weight matrices + zstd-22 compression
Scale clamping fix — clamp_min(1/clip_range) improves quantization quality
Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)
BigramHash(2048, dim=128) token embeddings
warmdown_iters=4500 for learning rate schedule
Higher learning rates (matrix_lr=0.025, scalar_lr=0.025)

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

@thwu1

…1309) 3-seed validation results: - Seed 42: val_bpb=1.13109, artifact=15,764,564 bytes - Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes - Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes - Mean: 1.13087 (std: 0.00017) Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers, LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization, zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500. Built on baseline by @thwu1 (PR openai#180).

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).

- Interleaved draft tokens: soft predictions placed between real tokens for 1-2 token lookahead via standard causal attention - SmearGate and BigramHash naturally gain future context on interleaved seq - Bigram noise curriculum: drafts anneal from GT to realistic noise - Two-pass eval: pass 1 generates drafts, pass 2 refines with interleaving - LeakyReLU(0.5)² activation toggle (free -0.003 BPB from PR openai#493) - W&B logging (opt-in via WANDB_PROJECT env var) - Sweep runner with 13 configs covering baselines, draft variants, and ablations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LeakyReLU(0.5)²: preserves negative gradient flow through MLP while maintaining non-negative output. ~0.003 BPB improvement per PR openai#493. Legal TTT (test-time training): at eval time, split val tokens into 32K-token chunks, score each chunk under inference_mode(), then train on the already-scored chunk with SGD. Gives ~0.0025 BPB improvement per PR openai#461. Score-first protocol guarantees no future information leaks into scored tokens.

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Merged

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

b08d72a

…stopher-Lee-McClendon

RoyiRa mentioned this pull request Mar 23, 2026

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589

Closed

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

Asukabot0 mentioned this pull request Mar 24, 2026

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638

Closed

anthony-maio mentioned this pull request Mar 24, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657

Closed

6 tasks

This was referenced Mar 25, 2026

11L LeakyReLU² + Partial RoPE + Int6 + EMA (~1.1200 BPB) vibhu1510/parameter-golf-vibhu#1

Closed

Non-record: 11L LeakyReLU² + Int6 + EMA (~1.1200 BPB) vibhu1510/parameter-golf-vibhu#2

Merged

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

stukenov mentioned this pull request Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733

Closed

gowtham0992 mentioned this pull request Mar 25, 2026

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970) #738

Closed

stukenov mentioned this pull request Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745

Closed

This was referenced Mar 25, 2026

Non-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253) #635

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

Asukabot0 mentioned this pull request Mar 25, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

anthony-maio mentioned this pull request Mar 25, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #175

Open

6 tasks

Robby955 mentioned this pull request Mar 25, 2026

Record: 0.9623 BPB — 7-Gram Entropy Cache + XSA-all + EBLS #777

Closed

8 tasks

Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

178fd95

…stopher-Lee-McClendon

SirSaltySalmon mentioned this pull request Mar 26, 2026

(Nonrecord) Applied Async Prefetching Potentially Boosts Performance #785

Open

Robby955 mentioned this pull request Mar 26, 2026

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796

Closed

TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026

Fix author attributions: PR #493 @parinzee, PR #461 @Christopher-Lee-…

4b7641a

…stopher-Lee-McClendon

nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

7866ff4

…stopher-Lee-McClendon

aryanbhosale mentioned this pull request Mar 26, 2026

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT — val_bpb 1.1215 (3-seed mean) #838

Open

quietsmile mentioned this pull request Mar 26, 2026

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks) #840

Open

6 tasks

aruniyer mentioned this pull request Mar 26, 2026

Non-record: 15L Depth Recurrence + LeakyReLU² — BI-guided weight tying (val_bpb=1.1360) #857

Draft

This was referenced Mar 26, 2026

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean) #864

Open

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean) #865

Open

anthony-maio mentioned this pull request Mar 26, 2026

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #376

Closed

7 tasks

lolrazh mentioned this pull request Mar 26, 2026

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean) #885

Open

This was referenced Mar 26, 2026

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #887

Closed

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889

Open

aryanbhosale mentioned this pull request Mar 26, 2026

Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed) #893

Closed

This was referenced Mar 27, 2026

Order-16 Frozen N-gram Oracle + Score-First TTT (0.02801 BPB) #924

Open

Record: Frozen N-gram Oracle (Order-16) + Score-First TTT (0.02807 BPB) #925

Closed

nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

23c9213

…stopher-Lee-McClendon

aptsalt mentioned this pull request Mar 27, 2026

submission: LeakyReLU² + EMA + BigramHash(20480) + MLP3.5x #941

Draft

4 tasks

ymrohit mentioned this pull request Mar 27, 2026

Record-track submission: 11L XSA4 + Late Shared Workspace Adapter (LSWA-64x4) + MLP2.5 #988

Closed

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

TimPietruskyRunPod mentioned this pull request Mar 28, 2026

Record: Muon TTT + Entropy-Adaptive Epochs — val_bpb 1.1179 (3-seed mean) #1037

Closed

5 tasks

Programmerryoki mentioned this pull request Mar 29, 2026

11L MLP2x + LeakyReLU² + Legal TTT (val_bpb=1.2201, 3-seed mean, std=0.0015) #1057

Open

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Anubhav ctw submission #1011

Open

vimeto mentioned this pull request Mar 29, 2026

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) #1072

Draft

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

mikeapedia mentioned this pull request Mar 29, 2026

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089

Open

teddyoweh mentioned this pull request Mar 29, 2026

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB #1092

Open

anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

2569ee6

…stopher-Lee-McClendon

AnubhavBharadwaaj mentioned this pull request Mar 30, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

d8bd62f

…stopher-Lee-McClendon

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6

parinzee commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parinzee commented Mar 23, 2026

Summary

3-Seed Results

Key Changes from Baseline

Run Command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant