12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) by unixmadtoonslab · Pull Request #76 · openai/parameter-golf

unixmadtoonslab · 2026-03-19T10:51:00Z

Summary

12-layer transformer (dim=512, 8H/4KV GQA) achieving 1.14327 BPB (3-seed mean: 1.14375 / 1.14316 / 1.14289).

Key techniques

Mixed int5/int6 quantization: int5 per-row for MLP weights (3 zero high bits -> better zstd-22 compression ~3.8x), int6 per-row for attention weights, fp16 embedding passthrough
12 layers (funded by ~1MB saved from int5 compression)
SmearGate: per-dim sigmoid gate blending token with previous token embedding
BigramHash: hash embedding for token-pair context (2048 buckets, dim=96)
U-Net skip connections: encoder-decoder split with learned per-dim skip weights
Orthogonal init with 1/sqrt(2*num_layers) output projection scaling
SWA: checkpoint averaging every 50 steps during warmdown
Muon optimizer: LR=0.025, momentum=0.98, WD=0.04
Warmdown timing fix: ignores torch.compile overhead in step-time estimation
PyTorch 2.8 SDPA: ~66ms/step at seq1024 (12 layers)

Results

Seed	val_bpb	Artifact Size
1337	1.14375	15.80 MB
42	1.14316	15.77 MB
7	1.14289	16.01 MB
Mean	1.14327

Config

NUM_LAYERS=12, MODEL_DIM=512, NUM_HEADS=8, NUM_KV_HEADS=4, MLP_MULT=3
TRAIN_SEQ_LEN=1024, EVAL_SEQ_LEN=1024, TRAIN_BATCH_TOKENS=524288
MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
MUON_MOMENTUM=0.98, MUON_WD=0.04, ADAM_WD=0.0
WARMDOWN_ITERS=2000, SWA_INTERVAL=50
BIGRAM_VOCAB_SIZE=2048, BIGRAM_DIM=96
MIXED_INT5_ENABLED=1

Test plan

3-seed validation (seeds 1337, 42, 7)
All artifacts under 16MB (16 MiB)
Quantized roundtrip validation

Submission by Will DePue

unixmadtoonslab · 2026-03-20T07:20:43Z

This submission is ready for review. Merge conflicts have been resolved.

val_bpb: 1.1599 (3-seed mean on 8xH100 SXM, 10 min)

The 'not ready for review' label is stale — I don't have permissions to remove it. This is a complete submission with validated results.

unixmadtoonslab · 2026-03-20T10:08:57Z

Compute Credits Feedback

I want to flag a significant issue with the compute credit program for this challenge.

OpenAI advertises $1,000,000 in compute credits to help participants. I applied for credits through the official form requesting $500 — a modest amount for a competition that requires 8xH100 SXM pods at ~$20/hr. I received $25.

$25 buys roughly one single training run on 8xH100s. That's not enough to even validate a baseline, let alone iterate on ideas. For context, I've spent over $300 out of my own pocket just to develop and test the techniques in this submission (QAT scheduling, int6 quantization, ablation studies across dozens of runs).

I understand compute is expensive and there are many participants, but $25 out of a $1M pool feels disconnected from the stated goal of helping people "get started training their models." A single 8xH100 validation run costs more than the entire grant.

Is this the level of support OpenAI intended for active participants who are pushing the leaderboard? I'd appreciate clarity on the credit allocation process, or an increase that would allow meaningful experimentation.

… ~1.160) Four orthogonal improvements stacked: int6 mixed-precision quantization on MLP+attention weights with zstd-22 compression, 3x MLP expansion, fp16 tied embedding passthrough, and sliding window evaluation. Awaiting 8xH100 SXM compute credits for official run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Key improvements over baseline: - Delayed QAT: STE fake-quantization only in last 15% of training time, allowing model to train at full precision before adapting to quantization - Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31] - Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup - Sliding window eval with stride=64 for better BPB measurement - fp16 embedding passthrough (tok_emb kept unquantized) 3-seed validation (seeds 1337, 42, 7): 1.15924, 1.15980, 1.16066 → mean 1.15990 BPB Beats current openai#1 (PR openai#88) at 1.1605 BPB.

…BPB) Major improvements over v6 baseline (1.1599 -> 1.1555 BPB): - 11 layers with orthogonal init (1/sqrt(2*N) output scaling) - SmearGate: blend token embeddings with previous token via learned gate - BigramHash: 4096-bucket hash embedding for token-pair context - Stochastic Weight Averaging during warmdown (interval=100) - Separate Muon/Adam weight decay (muon_wd=0.04, adam_wd=0.0) - FA3/SDPA dual attention path with NTK-aware RoPE - GQA support (8 heads, 4 KV heads) - QAT fraction configurable (disabled by default - fixes STE bug) - Higher LR (0.03) with lower momentum (0.97) - All hyperparameters configurable via environment variables 3-seed validation: 1.15520, 1.15492, 1.15649 (mean 1.15554) Artifact size: ~14.5MB (1.5MB headroom under 16MB limit)

…B, 3-seed mean) Key improvements over v8d (1.1555): - 12 layers (was 11) funded by int5 compression savings - Mixed int5/int6 quantization: int5 for MLP weights (better zstd ratio) - LR=0.025, momentum=0.97, warmdown=2000, SWA/50 - Warmdown fix: ignore torch.compile overhead in step timing - PyTorch 2.8 SDPA: ~59ms/step at seq1024 (was 81ms on 2.4) 3-seed validation: 1.14618 / 1.14768 / 1.14641 = mean 1.14676 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Major improvement from momentum 0.97→0.98 and reduced bigram to fit 16MB. 3-seed: 1.14375 / 1.14316 / 1.14289 = mean 1.14327 BPB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cocohearts · 2026-03-21T16:33:07Z

@unixmadtoonslab we'll be handing out higher compute credit grants shortly

cocohearts · 2026-03-21T16:34:10Z

@unixmadtoonslab pls clean ur diff to only include sota submission

Int6 per-row quantization (QUANT_RANGE=31) + zstd-22 compression fits MLP 3x in 16MB. seq1024 for max steps (~12K on 8xH100). Sliding window stride=64. Muon 0.99, LR=0.02, warmdown=3000. FP16 embedding. No QAT (overhead not worth it per PR openai#76). Targets ~1.16 BPB matching top submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq added the not ready for review label Mar 19, 2026

unixmadtoonslab changed the title ~~[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160)~~ Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window (val_bpb 1.1599) Mar 20, 2026

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

unixmadtoonslab and others added 3 commits March 20, 2026 10:10

Merge upstream main, keep our train_gpt.py

0d64b28

unixmadtoonslab force-pushed the submission/int6-wider-mlp-fp16embed-sliding branch from f9ab40a to 0d64b28 Compare March 20, 2026 10:11

unixmadtoonslab changed the title ~~Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window (val_bpb 1.1599)~~ Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window + SmearGate + BigramHash + SWA + OrthoInit (val_bpb 1.1568) Mar 20, 2026

unixmadtoonslab changed the title ~~Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window + SmearGate + BigramHash + SWA + OrthoInit (val_bpb 1.1568)~~ Int6 11L + SmearGate + BigramHash + SWA + OrthoInit + MuonWD (val_bpb 1.1555) Mar 20, 2026

graalolwest and others added 2 commits March 20, 2026 15:12

unixmadtoonslab changed the title ~~Int6 11L + SmearGate + BigramHash + SWA + OrthoInit + MuonWD (val_bpb 1.1555)~~ 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1468) Mar 20, 2026

0hq removed the not ready for review label Mar 20, 2026

v14d: 12L + mom0.98 + mixed int5/int6 + BigramHash 2048x96 (1.1433 BPB)

65d19d2

Major improvement from momentum 0.97→0.98 and reduced bigram to fit 16MB. 3-seed: 1.14375 / 1.14316 / 1.14289 = mean 1.14327 BPB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

unixmadtoonslab changed the title ~~12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1468)~~ 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) Mar 21, 2026

cocohearts added the record submission ready for review label Mar 21, 2026

This was referenced Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76

12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76
unixmadtoonslab wants to merge 6 commits intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding

unixmadtoonslab commented Mar 19, 2026 •

edited

Loading

Uh oh!

unixmadtoonslab commented Mar 20, 2026

Uh oh!

unixmadtoonslab commented Mar 20, 2026

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

unixmadtoonslab commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key techniques

Results

Config

Test plan

Uh oh!

unixmadtoonslab commented Mar 20, 2026

Uh oh!

unixmadtoonslab commented Mar 20, 2026

Compute Credits Feedback

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

unixmadtoonslab commented Mar 19, 2026 •

edited

Loading