Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training by saml212 · Pull Request #114 · openai/parameter-golf

saml212 · 2026-03-19T18:29:35Z

val_bpb = 1.1574 — Top-ranked valid submission at time of posting

Baseline: 1.2244. Improvement: -0.067 BPB / -0.120 nats. 15.98MB artifact. 600s training + 240s sliding window eval on 8xH100 SXM.

This is the third PR in a series (see also PR #61 at 1.2154 and PR #96 at 1.1764), each building on systematic experimentation across ~60 runs.

What's Novel

1. Train@2048 matches train@4096 when using sliding window eval. I tested both extensively — training at 2048 gives identical BPB to training at 4096 (1.1764 vs 1.1765) when evaluated with sliding window. The window already provides long context at eval; the model just needs to learn local patterns. Training at 2048 is strictly better because it gets more optimization steps in the 10-minute budget. Other top submissions train at 1024 or 4096 — 2048 is the overlooked sweet spot.

2. GRAD_CLIP_NORM=0.3 stabilizes long-sequence training. I swept from 0.0 to 1.0 and found a narrow optimum at 0.3 for train@2048. The mechanism: longer sequences produce higher gradient variance, and tight clipping (0.3 vs the common 1.0) stabilizes without over-constraining. Full sweep data in experiment log.

3. Batch=786K is optimal for train@2048. Swept from 393K to 1M. The default 524K is suboptimal; 786K balances gradient noise against step count.

4. Selective precision preservation. Two techniques stacked: (a) FP16 tied embedding — the dual-role embedding/output matrix is disproportionately damaged by quantization, discovered via systematic warmdown-quantization analysis in PR #61; (b) Late-K passthrough — last 2 layers' key projections kept in fp16 for sharper late-layer attention.

5. Stride=256 beats stride=64. Counter to other submissions using stride=64, I found stride=256 gives slightly better BPP (1.1574 vs 1.1579) at 4x less eval time. Diminishing returns from smaller strides.

Approach

Int6 post-training quantization (per-row scaling, ±31 range) compresses weight matrices ~25% smaller than int8, freeing artifact space to triple the MLP hidden dimension (1024 → 1536). 21.8M parameters in 15.98MB.

Component	Precision
Weight matrices (Q, K, V, O, MLP)	int6 (64 levels)
Tied embedding (tok_emb.weight)	fp16
Last 2 layers c_k.weight	fp16
Control tensors (scales, mixes, gains)	fp16

Training

TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 MLP_HIDDEN=1536
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 GRAD_CLIP_NORM=0.3

Evaluation

Sliding window with stride=256 at eval_seq_len=2048. Every scored token sees 1792+ tokens of context. 240s eval time on 8xH100 (well within the separate 10-min eval budget).

Eval Method	val_bpb
Non-overlapping	1.1792
Sliding stride=256	1.1574

Progression

PR	BPB	Key Advance
#61	1.2154	Warmdown-quantization discovery (WD=20000 reduces quant penalty 3x)
#96	1.1764	Long-context training + sliding window + clip=0.3 + batch=786K
#114	1.1574	Int6 + MLP 3x + selective precision

Full experiment log with ~60 runs, sweep data, and negative results available in the repository.

Root cause: quantize_float_tensor used raw abs().amax() for scale computation. With only 31 int6 levels, outlier weights inflate the scale and collapse most values to zero. Fixed by using percentile clipping (99.99984th quantile) matching PR openai#114's proven approach. Also fixed asymmetric range [-32,31] -> symmetric [-31,31]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq · 2026-03-19T21:36:45Z

Thanks for the great submission! Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great.

- SWA during warmdown: average ~7 checkpoints in second half of warmdown for flatter loss landscape generalization (-0.003-0.005 BPB) - GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep) - TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114) - EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

saml212 · 2026-03-20T02:33:57Z

@0hq totally fair, happy to rerun! My plane is taking off now though and Runpod doesn't have an 8xH100 pod available currently so I won't get to this until late tonight.

saml212 · 2026-03-20T08:24:07Z

@0hq Verification complete. Reran seeds 1337/1338/1339 on a fresh pod with the official RunPod template (py3.12 + torch 2.9.1+cu128):

Seed	Sliding BPB	Steps	ms/step
1337	1.1599	7,280	81.6
1338	1.1558	7,341	81.7
1339	1.1565	7,339	81.8
Mean	1.1574	7,320	81.7

Mean matches the original submission exactly. Seed-to-seed variance (std=0.0022) is from training stochasticity — Muon's distributed gradient round-trip introduces small numerical differences across runs. The original 3 seeds on my dev pod had std=0.0001 which was unusually tight.

Note: first run on a fresh pod was 100ms/step (cold torch.compile cache). After cache warms up it stabilizes at ~81.7ms/step. The verification seeds above are all warm-cache runs. Verification logs pushed to the branch.

…2ms/step faster

saml212 · 2026-03-23T18:34:12Z

3-seed verified, diff cleaned up

saml212 changed the title ~~Int6 + MLP 3x + sliding window: val_bpb=1.1574~~ #1 on leaderboard: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 19, 2026

notapplica mentioned this pull request Mar 19, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

tamoghnokandar mentioned this pull request Mar 20, 2026

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532 #173

Open

This was referenced Mar 20, 2026

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) #236

Open

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332

Open

turazashvili added a commit to turazashvili/parameter-golf that referenced this pull request Mar 21, 2026

perf: use native enable_gqa — matches PR openai#114 attention path, 1…

380d818

…2ms/step faster

saml212 force-pushed the sam/int6-mlp3x-slide branch from df4e221 to 6cc7150 Compare March 23, 2026 18:21

int6 mlp3x submission

2669909

saml212 force-pushed the sam/int6-mlp3x-slide branch from 6cc7150 to 2669909 Compare March 23, 2026 18:22

saml212 changed the title ~~#1 on leaderboard: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training~~ record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 23, 2026

saml212 changed the title ~~record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training~~ Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 23, 2026

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114

Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114
saml212 wants to merge 1 commit intoopenai:mainfrom
saml212:sam/int6-mlp3x-slide

saml212 commented Mar 19, 2026 •

edited

Loading

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

saml212 commented Mar 20, 2026

Uh oh!

saml212 commented Mar 20, 2026

Uh oh!

saml212 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saml212 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

val_bpb = 1.1574 — Top-ranked valid submission at time of posting

What's Novel

Approach

Training

Evaluation

Progression

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

saml212 commented Mar 20, 2026

Uh oh!

saml212 commented Mar 20, 2026

Uh oh!

saml212 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saml212 commented Mar 19, 2026 •

edited

Loading