Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) by thwu1 · Pull Request #180 · openai/parameter-golf

thwu1 · 2026-03-20T06:55:17Z

10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash

val_bpb: 1.14526 (sliding window stride=64, post int6+zstd roundtrip)

Key Innovation: Mixed Int5/Int6 Quantization

Use int5 [-16,15] for MLP weights and int6 [-32,31] for attention. Int5 has 3 zero high bits per byte — zstd-22 compresses at 1.88x vs int6's 1.51x, saving 1.86MB. This funds a 10th transformer layer while staying under 16MB.

Technique Stack

Mixed int5 MLP / int6 attention — 1.86MB artifact savings
10 layers (vs 9 baseline) — extra depth funded by int5 savings
MuonWD = 0.04 (swept 0.01–0.05) — improves quant friendliness
SWA every 50 steps (~29 checkpoint average, swept 25–200)
SmearGate + BigramHash + OrthoInit (from PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162)
zstd-22 compression, fp16 tied embedding passthrough

Metrics

Metric	Value
val_bpb (sliding window)	1.14526
Artifact	15,521,020 bytes (15.52MB)
Model params	24,730,705
Steps	6,694 in 600s (89.5 ms/step)
SWA checkpoints	29

Run Command

NUM_LAYERS=10 bash eval/eval.sh

Ablation

Config	BPB	Delta
9L int6 (PR162)	1.14847	baseline
+ int5 MLP (9L)	1.15663	+0.008 (quant cost)
+ 10th layer	1.14803	-0.0005
+ WD=0.04 + SWA/50	1.14526	-0.003

Built on PR #162 by @unnir.

Copilot

Pull request overview

Adds a new 10-minute / 16MB track record entry under records/track_10min_16mb/ documenting a 10-layer model that uses mixed int5 (MLP) + int6 (attention) post-training quantization, along with Muon weight decay tuning and SWA.

Changes:

Add the record’s training script (train_gpt.py) implementing mixed int5/int6 quantization and SWA settings used for the run.
Add run artifacts/documentation: training log and README describing the approach and results.
Add a submission.json metadata file for record tracking.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File	Description
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py	Record training/export script including mixed int5/int6 quantization + SWA logic
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_seed1337.log	Training log for the reported run/score
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md	Record write-up and reproduction guidance
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/submission.json	Record metadata for downstream indexing/leaderboard tooling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/submission.json

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab9ecb2f69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md

records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py

…/submission.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- MUON_WD: decoupled weight decay for Muon optimizer (0.04 = SOTA) p.data.mul_(1 - lr * wd) before gradient update - SWA_EVERY: Stochastic Weight Averaging every N steps (50 = SOTA) Accumulates running average of model weights, applies at end - Both controlled via env vars, disabled by default (0)

…hnique) - MLP weights use int5 [-16,15]: 3 zero high bits per byte → zstd 1.88x - Attention weights keep int6 [-32,31]: zstd 1.51x - Saves ~1.86MB artifact → funds 10th transformer layer - Dequantize auto-detects scheme via qmeta (int5/int6/int8)

thwu1 · 2026-03-20T16:35:33Z

Updated Results: val_bpb = 1.1428 (mean of 3 seeds)

Significant improvements over the original submission (1.1453):

Key Changes

bigram_vocab_size: 4096 → 10240 — More hash buckets reduce token-pair collisions (+0.001 bpb)
SWA_start_frac: 0.5 → 0.4 — Fewer but more converged SWA checkpoints (+0.0006 bpb)
warmdown_iters: 4000 → 3000 — More full-LR training steps (+0.0002 bpb)
weight_decay: 0.01 → 0.04 (global) — Better quantization robustness (+0.0002 bpb)

3-Seed Reproduction (eval_stride=64)

Seed	val_bpb	artifact_bytes	valid
42	1.14271	15,965,978	✅
1337	1.14298	15,830,186	✅
2024	1.14260	~15.8M	✅
Mean	1.14276
Std	0.00016

How to run

bash prepare.sh
bash eval/eval.sh > run.log 2>&1
grep "^val_bpb:\|^valid:" run.log

To run with a specific seed:

SEED=42 bash eval/eval.sh > run.log 2>&1

Full Recipe

10 layers, model_dim=512, 8 heads, 4 KV heads (GQA)
Int5 MLP + Int6 attention quantization + zstd-22 compression
BigramHash(10240, dim=128) + SmearGate + U-Net skip connections
SWA: start_frac=0.4, every=50 steps (24 checkpoint average)
Muon optimizer: matrix_lr=0.02, WD=0.04, momentum=0.99
AdamW for embeddings/scalars: WD=0.04
warmdown=3000 iters, warmup=20 steps, grad_clip=0.3
seq_len=2048, batch=786K tokens, seed=42
3% magnitude pruning, sliding window eval stride=64
Orthogonal init + muP scaling, tied embeddings in FP16 passthrough

Updating train_gpt.py in the PR now.

…0.04 Key improvements over original 1.1453: - bigram_vocab_size: 4096 → 10240 (fewer hash collisions) - SWA_start_frac: 0.5 → 0.4 (more converged checkpoints) - warmdown: 4000 → 3000 (more full-LR training) - weight_decay: 0.04 global (both Muon and AdamW) 3-seed results: 1.14271, 1.14298, 1.14260 (mean=1.14276, std=0.00016) All params set as defaults in train_gpt.py. Run: bash eval/eval.sh

…q ramp Adapted from PR #180 SOTA (1.1428 BPB): - INT5 quantization for MLP weights (int6 for attention) — saves ~1.86MB - zstd-22 compression instead of zlib — better ratio on sparse int5 data - 3% magnitude pruning before quantization — zeros compress well - Sequence length ramp: start at 256, ramp to full at 25% of training - QAT updated to fake-quantize int5 for MLP, int6 for rest New env vars: INT5_MLP, USE_ZSTD, ZSTD_LEVEL, PRUNE_PCT, SEQ_RAMP_START, SEQ_RAMP_FRAC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…its) Combines techniques from PR openai#162, openai#180, openai#267, openai#281: - 11-layer GPT with U-Net skip connections, GQA - SmearGate + BigramHash(10240) - Mixed int5/int6 quantization + 3% magnitude pruning - Causal TTT at eval time - SWA(frac=0.4), WD=0.042, Z-loss - Target: sub-1.135 val_bpb Awaiting RunPod 8xH100 credits for 3-seed validation.

PR #180 base (10L, BigramHash 10240, SWA, WD 0.04, 1.1428 BPB) with progressive MLP: skinny-early fat-late layer allocation. Includes 1-GPU A/B test script (uniform 3.0x vs progressive 1.5-4.5x). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch from SDPA (B,H,T,D) to Flash Attention 3 (B,T,H,D): - Import flash_attn_interface - Rotary cache shape [None,:,None,:] for BTHD layout - CausalSelfAttention: drop transpose, use flash_attn_3_func - q_gain broadcast adjusted for BTHD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three-way 1xGPU signal test: A: Uniform MLP 3.0× (exact #180 control) B: Progressive MLP 1.5→4.5× (same total params) C: Progressive MLP 1.5→4.5× + BigramHash 16384/192 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@thwu1

Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0. Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)

@thwu1

Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0. Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… BPB) Replaces openai#180 base with openai#315's full stack: - Partial RoPE (16/64 head dims) - LN Scale (1/sqrt(layer+1) depth dampening) - XSA on last N layers (efficient GQA-aware, no repeat_interleave) - EMA weight averaging (decay=0.997) - FA3 (flash_attn_3_func) - Late QAT support (note: inactive under torch.compile) - NTK-aware RoPE for length extrapolation Attention reuse (ATTN_REUSE=1) carried forward: - Decoder half layers drop Q,K projections, reuse from last encoder layer. Saves ~1.3MB compressed for extra capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ude pruning PR openai#180 is the current merged SOTA at 1.1428 BPB (no TTT needed). Key innovations: mixed int5/int6 quantization saves ~1.86MB for extra layer, BigramHash(10240) for bigram context, 3% magnitude pruning, orthogonal init with muP, Muon WD=0.04, and zstd-22 compression. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

@thwu1

…1309) 3-seed validation results: - Seed 42: val_bpb=1.13109, artifact=15,764,564 bytes - Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes - Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes - Mean: 1.13087 (std: 0.00017) Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers, LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization, zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500. Built on baseline by @thwu1 (PR openai#180).

@thwu1

… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970

…0.04 Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace every other MLP (layers 0,2,4,6,8) with BigLU — an MLP where the hidden state is gated by a per-layer bigram embedding (vocab=2048, dim=hidden, expansion scale=1). Reduce mlp_mult 3.0→1.5 (hidden 1536→768) so total MLP params stay identical to PR openai#180 (15.73M). - Muon for up/down weights; AdamW for bigram embed tables (like main bigram) - bigram.embed excluded from matrix_params to avoid Muon Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rt_bpb: 1.14594585 (seed 1337, 8xH200, 10-min wall-clock, 7306 steps) Built on thwu1's PR openai#180. Only 2 lines changed: BigramHash defaults updated from vocab=10240 dim=128 to vocab=16384 dim=64. Same ~1MB embedding budget, wider vocabulary captures more distinct bigram hash buckets. Proxy sweeps confirmed wider vocab > higher dim at equal parameter cost. Artifact: 15,923,771 bytes zstd-22. Single seed, additional seeds pending.

Non-record research submission. 2x2 factorial ablation of QAT x SWA interaction on PR openai#180 stack (10L/512d/MLP3x). Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse than either alone. This explains why prior QAT entries underperformed non-QAT submissions in the competition. 3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.

Record: 10L Int5-MLP + MuonWD=0.04 + SWA/50 (val_bpb=1.1453)

ab9ecb2

Copilot AI review requested due to automatic review settings March 20, 2026 06:55

Copilot started reviewing on behalf of thwu1 March 20, 2026 06:55 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

Update records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50…

a320f7c

…/submission.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

alertcat mentioned this pull request Mar 20, 2026

Non-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541 #219

Open

thwu1 changed the title ~~Record: 10L Int5-MLP + MuonWD=0.04 + SWA/50 (val_bpb=1.1453)~~ Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) Mar 20, 2026

Add 3-seed training logs (seed=42, 1337, 2024)

1a8be36

saml212 mentioned this pull request Mar 20, 2026

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) #236

Open

cocohearts added the record submission ready for review label Mar 20, 2026

cocohearts merged commit ee82226 into openai:main Mar 20, 2026

devin-ai-integration bot mentioned this pull request Mar 20, 2026

11L MLP3x int6: val_bpb=1.12109 (3-seed mean=1.12171, sliding window) andrewgcodes/parameter-golf#5

Open

5 tasks

stukenov mentioned this pull request Mar 20, 2026

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB) #264

Open

5 tasks

JackYoung27 mentioned this pull request Mar 21, 2026

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) #302

Open

SkywardSyntax mentioned this pull request Mar 21, 2026

Non-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035) #316

Open

6 tasks

romainsantoli-web mentioned this pull request Mar 21, 2026

11L SmearGate + BigramHash(10240) + Causal TTT + Mixed Int5/Int6 + SWA #322

Draft

leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026

Merge pull request openai#180 from thwu1/10L-int5mlp-wd04-swa50

f31ceed

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)

parinzee mentioned this pull request Mar 22, 2026

10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370) #434

Closed

zachgoldfine44 mentioned this pull request Mar 22, 2026

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds) #450

Open

JoeProAI mentioned this pull request Mar 22, 2026

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) #462

Closed

harsha-gouru mentioned this pull request Mar 23, 2026

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) #477

Closed

3 tasks

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

This was referenced Mar 23, 2026

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) #482

Closed

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) #485

Open

parinzee mentioned this pull request Mar 23, 2026

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493

Open

Robby955 mentioned this pull request Mar 23, 2026

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) #484

Closed

SelfAnush mentioned this pull request Mar 23, 2026

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970) #510

Open

joey00072 mentioned this pull request Mar 23, 2026

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 joey00072/parameter-golf#1

Open

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

LoquiAuris mentioned this pull request Mar 23, 2026

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds) #548

Closed

LoquiAuris mentioned this pull request Mar 24, 2026

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100) #595

Closed

anantdgoel mentioned this pull request Mar 24, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

hypery11 mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

alexanderaperry-arch mentioned this pull request Mar 27, 2026

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated) #989

Open

mrdavtan mentioned this pull request Mar 29, 2026

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers) #1048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)#180

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)#180
cocohearts merged 4 commits intoopenai:mainfrom
thwu1:10L-int5mlp-wd04-swa50

thwu1 commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thwu1 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thwu1 commented Mar 20, 2026

10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash

Key Innovation: Mixed Int5/Int6 Quantization

Technique Stack

Metrics

Run Command

Ablation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thwu1 commented Mar 20, 2026

Updated Results: val_bpb = 1.1428 (mean of 3 seeds)

Key Changes

3-Seed Reproduction (eval_stride=64)

How to run

Full Recipe

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants