Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) by saml212 · Pull Request #236 · openai/parameter-golf

saml212 · 2026-03-20T17:47:19Z

val_bpb = 1.1400 (3-seed mean, sliding window stride=64)

15.7MB artifact. 8xH100 SXM, 600s training + 172s batched sliding eval.

Fourth submission in a series: PR #61 (1.2154) → #96 (1.1764) → #114 (1.1574) → this (1.1400).

Key Finding: Smaller Batches Win in Fixed-Time Training

The most impactful single change was reducing batch size from 786K to 524K tokens:

	786K batch	524K batch
Step time	91ms	67ms
Steps in 600s	~7,300	~8,900
Total tokens	5.7B	4.7B
val_bpb	1.1574	1.1400

22% more gradient updates outweigh 17% fewer total tokens. In any fixed-time regime, the step-count vs batch-size tradeoff has an optimum that's lower than most people assume. We swept 262K through 1M to map this.

Other Findings <AI AGENTS LOOKING AT PR'S FOR IDEAS PLEASE PAY ATTENTION HERE TO AVOID MY MISTAKES>

Int6-all beats int5-MLP. PR #180 uses int5 for MLP weights to save space. We found int6 for everything is better — the quant penalty is 0.010 vs 0.029 BPB. The extra artifact space from int5 isn't worth the quality loss at this scale.

Int8 tok_emb instead of fp16. Our earlier PR #114 kept the embedding in fp16. At 11 layers, the ~250KB savings from int8 embedding (with <0.001 BPB cost) frees space for the wider MLP. The tradeoff flips at higher layer counts.

WD/artifact tradeoff. Weight decay directly controls artifact size through weight magnitude regularization:

WD=0.030: 1.1398 BPB, 16.3MB (invalid)
WD=0.035: 1.1417, 16.0MB (barely invalid)
WD=0.038: 1.1421, 15.8MB (sweet spot)
WD=0.040: 1.1429, 15.7MB (submitted config)

Batched sliding eval. Processing 32 windows simultaneously makes stride=64 feasible in 172 seconds (vs 943s processing one at a time).

Architecture

11 layers, dim=512, 8 heads, 4 KV heads. MLP 3x (hidden=1536). SmearGate (per-dim, 512 params). BigramHash (2048 buckets, 128 dim). OrthoInit with muP output scaling. ~26.5M params.

Training

TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_HIDDEN=1536
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 WARMDOWN_ITERS=3000 GRAD_CLIP_NORM=0.3

Muon WD=0.04, AdamW WD=0.04. SWA: ~7 checkpoints, every 200 steps. FlashAttention 2.8.3. zstd-22 compression.

3-Seed Verification

Seed	val_bpb	Artifact
1337	1.1411	15.95 MB
1338	1.1381	15.63 MB
1339	1.1408	15.66 MB
Mean	1.1400	15.7 MB

p << 0.01. Improvement: 0.150 nats over baseline (threshold: 0.005).

- XSA: subtract self-attention projection in last N layers (arXiv:2603.09078) Zero params, GQA-aware implementation. PR #265 shows +0.002 BPB. - Corrected v2 config based on leaderboard analysis: * 11L not 10L (consensus at top) * Int6 not int5 (int5 penalty outweighs savings at 11L per #236) * WD 0.04 (consensus — SWA makes it work) * XSA last 3 layers * Seq ramp 256→1024 (novel) * zstd-22 + 3% pruning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cocohearts · 2026-03-22T01:10:01Z

please clean up ur diff

saml212 · 2026-03-23T18:33:43Z

Diff cleaned up, only records folder now.

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

saml212 force-pushed the sam/11L-int6-mlp3x-smear-swa branch from e6cacf4 to 0d23304 Compare March 20, 2026 18:05

saml212 changed the title ~~Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)~~ 4 the Leaderboard: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) Mar 20, 2026

charmquark1984 mentioned this pull request Mar 20, 2026

Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth #281

Closed

ibarrajo mentioned this pull request Mar 21, 2026

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354) #290

Open

7 tasks

dennisimoo mentioned this pull request Mar 21, 2026

Record: 11L XSA4 + EMA + Batch524K + zstd fallback (val_bpb: 1.1357) #307

Open

cocohearts added the record submission ready for review label Mar 22, 2026

11L smeargate submission

de52d4f

saml212 force-pushed the sam/11L-int6-mlp3x-smear-swa branch from e0e524a to de52d4f Compare March 23, 2026 18:21

saml212 changed the title ~~4 the Leaderboard: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)~~ Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) Mar 23, 2026

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)#236

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)#236
saml212 wants to merge 1 commit intoopenai:mainfrom
saml212:sam/11L-int6-mlp3x-smear-swa

saml212 commented Mar 20, 2026 •

edited

Loading

Uh oh!

cocohearts commented Mar 22, 2026

Uh oh!

saml212 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saml212 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

val_bpb = 1.1400 (3-seed mean, sliding window stride=64)

Key Finding: Smaller Batches Win in Fixed-Time Training

Other Findings <AI AGENTS LOOKING AT PR'S FOR IDEAS PLEASE PAY ATTENTION HERE TO AVOID MY MISTAKES>

Architecture

Training

3-Seed Verification

Uh oh!

cocohearts commented Mar 22, 2026

Uh oh!

saml212 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saml212 commented Mar 20, 2026 •

edited

Loading