Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 by aquariouseworkman · Pull Request #65 · openai/parameter-golf

aquariouseworkman · 2026-03-19T07:42:39Z

Submission SmearGate + OrthoInit + Muon WD + Int6 STE QAT + MLP 3x + Sliding Window

val_bpb: 1.1556 (post-quant int6+zstd-22, sliding window eval stride=64)

Summary

A 22.4M parameter transformer language model trained in under 10 minutes on 8×H100 GPUs, compressed to a 15.1MB artifact via int6 quantization-aware training and zstd-22. The architecture combines a SmearGate bigram embedding layer, orthogonal weight initialization, 3× MLP expansion, U-Net skip connections, and decoupled Muon weight decay, evaluated with sliding window context at stride 64.

Architecture

Transformer Core

A 9-layer, 512-dim transformer with 8 attention heads (4 KV heads via grouped-query attention) and tied input/output embeddings over a 1024-token BPE vocabulary. Sequence length during training is 1024 tokens.

SmearGate

A learned per-dimension gate (~512 params) that blends each token's embedding with the previous token's embedding before the transformer processes anything:

gate = sigmoid(self.gate)  # shape \[dim], init ≈ 0.95
output = gate \* current\_emb + (1 - gate) \* prev\_token\_emb

This injects bigram (two-token) context directly into the embedding layer. Normally a transformer must discover token-pair relationships through self-attention; SmearGate provides this signal for free. The gate is initialized via sigmoid(3.0) ≈ 0.95 so it starts near-identity (mostly current token), and the model learns per-dimension how much previous-token blending is useful.

Applied after embedding lookup and bigram hash addition, before RMS normalization.

Bigram Hash Embedding

A 4096-bucket hash table (dim=128, projected to 512) maps consecutive token pairs to learned embeddings via (prev \* 92821 + cur) % 4096. This gives the model direct access to token-pair features at minimal parameter cost.

MLP 3× Expansion

MLP hidden dimension is 3× the model dimension (1536 for a 512-dim model). The space savings from int6 quantization fund this extra capacity — wider MLPs allow more expressive nonlinear feature transformation between attention operations.

U-Net Skip Connections

The 9-layer transformer is split into an encoder half (4 layers) and a decoder half (5 layers) with learned skip weights connecting corresponding encoder/decoder layers. This gives the decoder direct access to earlier representations without relying solely on the residual stream.

Training

Muon Optimizer with Weight Decay

The Muon optimizer (MomentUm Orthogonalized by Newton-Schulz) runs SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via 5-step Newton-Schulz iteration. This is equivalent to steepest descent under the spectral norm, improving the conditioning of the optimization landscape.

Decoupled weight decay (p.mul\_(1 - wd \* lr), wd=0.01) is applied before each gradient update. This keeps weights smaller and better-distributed, which directly benefits both generalization and downstream quantization — tighter weight distributions quantize into fewer int6 buckets with less error and compress better with zstd.

Momentum is warmed from 0.92 → 0.99 over the first 1500 steps.

Orthogonal Weight Initialization

All non-zero-init CastedLinear weight matrices are initialized with nn.init.orthogonal\_(). Orthogonal matrices have all singular values equal to 1, meaning gradients flow uniformly through the network at initialization with no vanishing or exploding signals. Additionally, since Muon's Newton-Schulz step orthogonalizes updates, starting from an already-orthogonal matrix means early updates are immediately useful rather than spent correcting a random initialization. With only ~12k steps in the 10-minute budget, faster convergence matters.

Int6 Quantization-Aware Training (STE)

All 2D weight matrices are fake-quantized to int6 ([-31, 31]) during every forward pass via Straight-Through Estimator — the forward pass sees quantized weights while gradients flow through the rounding operation as if it were identity. The model learns weight configurations that are inherently robust to post-training quantization. The tied embedding matrix is stored as fp16 passthrough (not quantized), since it serves double duty for both input embeddings and output predictions where errors compound in both directions.

Learning Rate Schedule

Warmup over 20 steps, followed by linear warmdown over the final 3000 steps. Separate learning rates for tied embeddings (0.030), matrix parameters (0.020), and scalar parameters (0.020).

Evaluation

Sliding Window (stride=64)

Instead of chopping validation text into non-overlapping chunks (where tokens near the start of each chunk lack context), sliding window uses overlapping windows with stride 64 and the full 1024-token context window. Each scored token gets 960+ tokens of prior context. This is purely an evaluation-time technique — it does not change the model.

Export

Int6 + zstd-22 Compression

All quantized weights are packed into int8 containers and compressed with zstandard at level 22. The int6 representation plus aggressive compression brings the full submission (model + code) to 15.1MB, under the 16MB cap.

Metrics

Metric	Value
Post-quant sliding window val_bpb	1.1556
Post-quant sliding window val_loss	1.9511
Post-quant standard val_bpb	1.1891
Post-quant standard val_loss	2.0077
Quantization gap (standard eval)	~0.0001 BPB
Model parameters	22,368,840
Artifact size (int6+zstd-22)	15,878,809 bytes (15.1 MB)
Train steps completed	12,047
Train time	600s (10.0 min)
Sliding window eval time	75s
Peak GPU memory	11,340 MiB

Configuration

VOCAB\_SIZE=1024
NUM\_LAYERS=9
MODEL\_DIM=512
NUM\_HEADS=8
NUM\_KV\_HEADS=4
MLP\_MULT=3
TIE\_EMBEDDINGS=1
USE\_SMEARGATE=1
TRAIN\_SEQ\_LEN=1024
TRAIN\_BATCH\_TOKENS=524288
LOGIT\_SOFTCAP=30.0
ROPE\_BASE=10000.0
QK\_GAIN\_INIT=1.5
BIGRAM\_HASH\_BUCKETS=4096
BIGRAM\_HASH\_DIM=128
TIED\_EMBED\_LR=0.030
MATRIX\_LR=0.020
SCALAR\_LR=0.020
MUON\_MOMENTUM=0.99
MUON\_MOMENTUM\_WARMUP\_START=0.92
MUON\_MOMENTUM\_WARMUP\_STEPS=1500
MUON\_WEIGHT\_DECAY=0.01
MUON\_BACKEND\_STEPS=5
WARMDOWN\_ITERS=3000
WARMUP\_STEPS=20
EVAL\_STRIDE=64
MAX\_WALLCLOCK\_SECONDS=600
SEED=1337

Command

RUN\_ID=smeargate\_orthoinit\_muonwd \\
DATA\_PATH=./data/datasets/fineweb10B\_sp1024 \\
TOKENIZER\_PATH=./data/tokenizers/fineweb\_1024\_bpe.model \\
torchrun --standalone --nproc\_per\_node=8 train\_gpt.py

Hardware

8× NVIDIA H100 80GB HBM3 SXM (RunPod).

…_bpb 1.1652 Stack five techniques from systematic PR analysis: - MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70) - int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70) - zstd-22 compression (from PR openai#70) - TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65) - Sliding window eval at stride=64 with compiled forward_logits Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001). Three seeds: 1.16615, 1.16532, 1.16412. Artifact: 15.6MB (under 16,000,000 byte cap). Training: 9370 steps at 64ms/step on 8xH100 SXM. Made-with: Cursor

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.

- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep

- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61 - record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage - keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Straight-Through Estimator fake int6 quantization to CastedLinear during training. Forward pass uses quantized weights (int6 per-row), backward passes gradients through originals. Teaches weight distributions that survive post-training int6 quantization. Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64. Three seeds: - SEED=1337: val_bpb=1.16356083 - SEED=1338: val_bpb=1.16275343 - SEED=1339: val_bpb=1.16337225 Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001. Artifact: 15.3MB (under 16,000,000 byte cap). Made-with: Cursor

Downloaded PR openai#65 SOTA train_gpt.py (1.1630 BPB). Added zstandard dep, use_sota flag to toggle between baseline and SOTA scripts. 5-min baseline recorded: val_bpb=1.3738, post-quant=1.3766. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

val_bpb: 1.1556 (post-quant int6+zstd-22, sliding window eval stride=64) Summary A 22.4M parameter transformer language model trained in under 10 minutes on 8×H100 GPUs, compressed to a 15.1MB artifact via int6 quantization-aware training and zstd-22. The architecture combines a SmearGate bigram embedding layer, orthogonal weight initialization, 3× MLP expansion, U-Net skip connections, and decoupled Muon weight decay, evaluated with sliding window context at stride 64. Architecture Transformer Core A 9-layer, 512-dim transformer with 8 attention heads (4 KV heads via grouped-query attention) and tied input/output embeddings over a 1024-token BPE vocabulary. Sequence length during training is 1024 tokens. SmearGate A learned per-dimension gate (~512 params) that blends each token's embedding with the previous token's embedding before the transformer processes anything: ```python gate = sigmoid(self.gate) # shape \[dim], init ≈ 0.95 output = gate \* current\_emb + (1 - gate) \* prev\_token\_emb ``` This injects bigram (two-token) context directly into the embedding layer. Normally a transformer must discover token-pair relationships through self-attention; SmearGate provides this signal for free. The gate is initialized via `sigmoid(3.0) ≈ 0.95` so it starts near-identity (mostly current token), and the model learns per-dimension how much previous-token blending is useful. Applied after embedding lookup and bigram hash addition, before RMS normalization. Bigram Hash Embedding A 4096-bucket hash table (dim=128, projected to 512) maps consecutive token pairs to learned embeddings via `(prev \* 92821 + cur) % 4096`. This gives the model direct access to token-pair features at minimal parameter cost. MLP 3× Expansion MLP hidden dimension is 3× the model dimension (1536 for a 512-dim model). The space savings from int6 quantization fund this extra capacity — wider MLPs allow more expressive nonlinear feature transformation between attention operations. U-Net Skip Connections The 9-layer transformer is split into an encoder half (4 layers) and a decoder half (5 layers) with learned skip weights connecting corresponding encoder/decoder layers. This gives the decoder direct access to earlier representations without relying solely on the residual stream. Training Muon Optimizer with Weight Decay The Muon optimizer (MomentUm Orthogonalized by Newton-Schulz) runs SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via 5-step Newton-Schulz iteration. This is equivalent to steepest descent under the spectral norm, improving the conditioning of the optimization landscape. Decoupled weight decay (`p.mul\_(1 - wd \* lr)`, wd=0.01) is applied before each gradient update. This keeps weights smaller and better-distributed, which directly benefits both generalization and downstream quantization — tighter weight distributions quantize into fewer int6 buckets with less error and compress better with zstd. Momentum is warmed from 0.92 → 0.99 over the first 1500 steps. Orthogonal Weight Initialization All non-zero-init CastedLinear weight matrices are initialized with `nn.init.orthogonal\_()`. Orthogonal matrices have all singular values equal to 1, meaning gradients flow uniformly through the network at initialization with no vanishing or exploding signals. Additionally, since Muon's Newton-Schulz step orthogonalizes updates, starting from an already-orthogonal matrix means early updates are immediately useful rather than spent correcting a random initialization. With only ~12k steps in the 10-minute budget, faster convergence matters. Int6 Quantization-Aware Training (STE) All 2D weight matrices are fake-quantized to int6 ([-31, 31]) during every forward pass via Straight-Through Estimator — the forward pass sees quantized weights while gradients flow through the rounding operation as if it were identity. The model learns weight configurations that are inherently robust to post-training quantization. The tied embedding matrix is stored as fp16 passthrough (not quantized), since it serves double duty for both input embeddings and output predictions where errors compound in both directions. Learning Rate Schedule Warmup over 20 steps, followed by linear warmdown over the final 3000 steps. Separate learning rates for tied embeddings (0.030), matrix parameters (0.020), and scalar parameters (0.020). Evaluation Sliding Window (stride=64) Instead of chopping validation text into non-overlapping chunks (where tokens near the start of each chunk lack context), sliding window uses overlapping windows with stride 64 and the full 1024-token context window. Each scored token gets 960+ tokens of prior context. This is purely an evaluation-time technique — it does not change the model. Export Int6 + zstd-22 Compression All quantized weights are packed into int8 containers and compressed with zstandard at level 22. The int6 representation plus aggressive compression brings the full submission (model + code) to 15.1MB, under the 16MB cap. Metrics Metric Value Post-quant sliding window val_bpb 1.1556 Post-quant sliding window val_loss 1.9511 Post-quant standard val_bpb 1.1891 Post-quant standard val_loss 2.0077 Quantization gap (standard eval) ~0.0001 BPB Model parameters 22,368,840 Artifact size (int6+zstd-22) 15,878,809 bytes (15.1 MB) Train steps completed 12,047 Train time 600s (10.0 min) Sliding window eval time 75s Peak GPU memory 11,340 MiB Configuration ``` VOCAB\_SIZE=1024 NUM\_LAYERS=9 MODEL\_DIM=512 NUM\_HEADS=8 NUM\_KV\_HEADS=4 MLP\_MULT=3 TIE\_EMBEDDINGS=1 USE\_SMEARGATE=1 TRAIN\_SEQ\_LEN=1024 TRAIN\_BATCH\_TOKENS=524288 LOGIT\_SOFTCAP=30.0 ROPE\_BASE=10000.0 QK\_GAIN\_INIT=1.5 BIGRAM\_HASH\_BUCKETS=4096 BIGRAM\_HASH\_DIM=128 TIED\_EMBED\_LR=0.030 MATRIX\_LR=0.020 SCALAR\_LR=0.020 MUON\_MOMENTUM=0.99 MUON\_MOMENTUM\_WARMUP\_START=0.92 MUON\_MOMENTUM\_WARMUP\_STEPS=1500 MUON\_WEIGHT\_DECAY=0.01 MUON\_BACKEND\_STEPS=5 WARMDOWN\_ITERS=3000 WARMUP\_STEPS=20 EVAL\_STRIDE=64 MAX\_WALLCLOCK\_SECONDS=600 SEED=1337 ``` Command ```bash RUN\_ID=smeargate\_orthoinit\_muonwd \\ DATA\_PATH=./data/datasets/fineweb10B\_sp1024 \\ TOKENIZER\_PATH=./data/tokenizers/fineweb\_1024\_bpe.model \\ torchrun --standalone --nproc\_per\_node=8 train\_gpt.py ``` Hardware 8× NVIDIA H100 80GB HBM3 SXM (RunPod).

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

9d318e7

devin-ai-integration bot mentioned this pull request Mar 19, 2026

9L MLP3x + STE int6 QAT + ROPE=200K + warmdown=14K: val_bpb=0.9588 — 0.2656 nats over baseline andrewgcodes/parameter-golf#1

Open

5 tasks

arjun-krishna1 mentioned this pull request Mar 19, 2026

ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632 #66

Open

aquariouseworkman changed the title ~~Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808~~ Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630 Mar 19, 2026

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Closed

yesbhautik mentioned this pull request Mar 19, 2026

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) #64

Open

0hq added the record submission ready for review label Mar 19, 2026

mtybadger mentioned this pull request Mar 19, 2026

Record: Sliding Window Eval, 2048 Vocab Size, fp16 embeddings, SWA, NorMuon, FA3; mean_val_bpb:1.160 #122

Open

rsavitt mentioned this pull request Mar 19, 2026

Record: Int6 MLP3x + STE QAT + Sliding Window (val_bpb=1.1594) #128

Open

5 tasks

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

mrdavtan mentioned this pull request Mar 20, 2026

Non-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045) #151

Closed

4 tasks

aquariouseworkman and others added 2 commits March 19, 2026 23:26

Merge branch 'openai:main' into main

3aface5

aquariouseworkman changed the title ~~Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630~~ Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 Mar 20, 2026

cocohearts approved these changes Mar 20, 2026

View reviewed changes

cocohearts merged commit 5b26c56 into openai:main Mar 20, 2026

leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026

Merge pull request openai#65 from aquariouseworkman/main

d539217

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

This was referenced Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) #1170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556#65

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556#65
cocohearts merged 4 commits intoopenai:mainfrom
aquariouseworkman:main

aquariouseworkman commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aquariouseworkman commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission SmearGate + OrthoInit + Muon WD + Int6 STE QAT + MLP 3x + Sliding Window

Summary

Architecture

Transformer Core

SmearGate

Bigram Hash Embedding

MLP 3× Expansion

U-Net Skip Connections

Training

Muon Optimizer with Weight Decay

Orthogonal Weight Initialization

Int6 Quantization-Aware Training (STE)

Learning Rate Schedule

Evaluation

Sliding Window (stride=64)

Export

Int6 + zstd-22 Compression

Metrics

Configuration

Command

Hardware

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aquariouseworkman commented Mar 19, 2026 •

edited

Loading