Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) by raahilshah · Pull Request #162 · openai/parameter-golf

raahilshah · 2026-03-20T03:33:38Z

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)

Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).

Key Techniques

Per-row int6 quantization ([-32,31]) on MLP + attention weights, fp16 passthrough for tied embeddings and last-layer key projection. zstd level 22 compression.
3× MLP expansion (hidden=1536) — enabled by int6 byte savings. Single largest improvement source.
SmearGate — learned gate blending each token embedding with the previous token's (~512 params).
BigramHash embedding — 4096-bucket hash table (dim=128→512) for token-pair context (~524K params).
Orthogonal init + muP scaling — orthogonal weight init, output projections scaled by 1/√(2·num_layers).
Muon WD=0.02 with momentum warmup 0.92→0.99 over 1500 steps. AdamW WD=0.01 for embeddings/scalars.
SWA over last 50% of training (every 200 steps) — smoother weights, better quantization.

Hyperparameters

9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.

Metrics

Seed	val_loss	val_bpb
1337	1.93978	1.14885
42	1.93923	1.14852
7	1.93762	1.14757
Mean	1.93888	1.14831

Pre-quant val_bpb: 1.1640
Steps: 7,373 in 600s (81.4 ms/step)
Artifact: 15.92MB (int6+zstd-22)
Improvement over current SOTA (1.1748): -0.0265 bpb / -0.046 nats

… (mean val_bpb=1.1483, 3 seeds)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14cdf6f7a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T03:39:16Z

records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA/train_gpt.py

+            ).reshape(bsz, seq_len)
+            for i, ws in enumerate(batch_ws):
+                wlen = wlens[i]
+                s = 0 if ws == 0 else max(wlen - stride, 0)


Avoid double-counting tail tokens in sliding eval

The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.

Useful? React with 👍 / 👎.

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458

raahilshah · 2026-03-20T05:46:31Z

Updated submission — improved configuration after systematic hyperparameter sweeps:

Muon weight decay: 0.02 → 0.04 (swept 0.01–0.05, optimal at 0.04)
SWA frequency: every 200 steps → every 50 steps (swept 25–200, optimal at 50; ~30 checkpoint average)

New results (3 seeds):

Seed	val_loss	val_bpb
1337	1.93492	1.14597
42	1.93591	1.14656
7	1.93314	1.14492
Mean	1.93466	1.14582

Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone).

- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params) USE_SMEAR_GATE=1 to enable - BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512) USE_BIGRAM_HASH=1 to enable (~524K params) - Both disabled by default for backward compatibility - forward_with_adapter refactored to reuse _forward_body

- STE QAT: fake quantize->dequantize in CastedLinear forward pass Gradients pass through via STE (w + (w_hat - w).detach()) Activates after STE_QAT_START_FRAC of training (default 25%) USE_STE_QAT=1 to enable - forward_with_adapter refactored to reuse _forward_body - All Tier 2 features are env-var controlled, disabled by default

…1.3260 BPB Key improvements to train_exp.py: - BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162) - SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction - SWA early-start bug fix (minimum 100 steps before activation) - FTLE-lite sensitivity-aware mixed-precision quantization (experimental) - Eval-time extra recurrence support (not useful for non-shared models) - Sliding window eval safety: skips if estimated time > 600s Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22) Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@notapplica

Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@mattqlf

Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.

…its) Combines techniques from PR openai#162, openai#180, openai#267, openai#281: - 11-layer GPT with U-Net skip connections, GQA - SmearGate + BigramHash(10240) - Mixed int5/int6 quantization + 3% magnitude pruning - Causal TTT at eval time - SWA(frac=0.4), WD=0.042, Z-loss - Target: sub-1.135 val_bpb Awaiting RunPod 8xH100 credits for 3-seed validation.

Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)

@notapplica

Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@mattqlf

Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@notapplica

Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@mattqlf

Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

Add submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA…

14cdf6f

… (mean val_bpb=1.1483, 3 seeds)

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Julz19 mentioned this pull request Mar 20, 2026

Add ContextFuse-2048-BigramSmear submission #174

Open

Update submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_…

3c458dc

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458

thwu1 mentioned this pull request Mar 20, 2026

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180

Merged

alertcat mentioned this pull request Mar 20, 2026

Non-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541 #219

Open

cocohearts merged commit 8b2b17e into openai:main Mar 20, 2026

stukenov mentioned this pull request Mar 20, 2026

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB) #264

Open

5 tasks

brn-mwai mentioned this pull request Mar 20, 2026

Record: 11L Int6 + SmearGate + BigramHash + Depth Recurrence #268

Open

6 tasks

integrate-your-mind mentioned this pull request Mar 21, 2026

SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB) #289

Open

7 tasks

JackYoung27 mentioned this pull request Mar 21, 2026

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) #302

Open

aravhawk mentioned this pull request Mar 21, 2026

11L Int4 MLP QAT + BigramHash(10240) + SWA #314

Open

romainsantoli-web mentioned this pull request Mar 21, 2026

11L SmearGate + BigramHash(10240) + Causal TTT + Mixed Int5/Int6 + SWA #322

Draft

MatoTeziTanka mentioned this pull request Mar 21, 2026

PROTEUS v4 — non-record submission (val_bpb: 1.2037) #368

Open

HyperPotatoNeo mentioned this pull request Mar 21, 2026

11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16 | val_bpb=1.1361 #372

Closed

mrdavtan mentioned this pull request Mar 21, 2026

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization #212

Closed

LoquiAuris mentioned this pull request Mar 22, 2026

Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508) #465

Open

ADIITJ mentioned this pull request Mar 22, 2026

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ) #467

Open

6 tasks

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

Robby955 mentioned this pull request Mar 23, 2026

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) #484

Closed

joey00072 mentioned this pull request Mar 23, 2026

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 joey00072/parameter-golf#1

Open

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

This was referenced Mar 23, 2026

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds) #548

Closed

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100) #595

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

hypery11 mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

mrdavtan mentioned this pull request Mar 29, 2026

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers) #1048

Open

This was referenced Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
cocohearts merged 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

raahilshah commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Uh oh!

raahilshah commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raahilshah commented Mar 20, 2026

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Key Techniques

Hyperparameters

Metrics

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

raahilshah commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants