Skip to content

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)#180

Merged
cocohearts merged 4 commits intoopenai:mainfrom
thwu1:10L-int5mlp-wd04-swa50
Mar 20, 2026
Merged

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)#180
cocohearts merged 4 commits intoopenai:mainfrom
thwu1:10L-int5mlp-wd04-swa50

Conversation

@thwu1
Copy link
Copy Markdown
Contributor

@thwu1 thwu1 commented Mar 20, 2026

10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash

val_bpb: 1.14526 (sliding window stride=64, post int6+zstd roundtrip)

Key Innovation: Mixed Int5/Int6 Quantization

Use int5 [-16,15] for MLP weights and int6 [-32,31] for attention. Int5 has 3 zero high bits per byte — zstd-22 compresses at 1.88x vs int6's 1.51x, saving 1.86MB. This funds a 10th transformer layer while staying under 16MB.

Technique Stack

  1. Mixed int5 MLP / int6 attention — 1.86MB artifact savings
  2. 10 layers (vs 9 baseline) — extra depth funded by int5 savings
  3. MuonWD = 0.04 (swept 0.01–0.05) — improves quant friendliness
  4. SWA every 50 steps (~29 checkpoint average, swept 25–200)
  5. SmearGate + BigramHash + OrthoInit (from PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162)
  6. zstd-22 compression, fp16 tied embedding passthrough

Metrics

Metric Value
val_bpb (sliding window) 1.14526
Artifact 15,521,020 bytes (15.52MB)
Model params 24,730,705
Steps 6,694 in 600s (89.5 ms/step)
SWA checkpoints 29

Run Command

NUM_LAYERS=10 bash eval/eval.sh

Ablation

Config BPB Delta
9L int6 (PR162) 1.14847 baseline
+ int5 MLP (9L) 1.15663 +0.008 (quant cost)
+ 10th layer 1.14803 -0.0005
+ WD=0.04 + SWA/50 1.14526 -0.003

Built on PR #162 by @unnir.

Copilot AI review requested due to automatic review settings March 20, 2026 06:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10-minute / 16MB track record entry under records/track_10min_16mb/ documenting a 10-layer model that uses mixed int5 (MLP) + int6 (attention) post-training quantization, along with Muon weight decay tuning and SWA.

Changes:

  • Add the record’s training script (train_gpt.py) implementing mixed int5/int6 quantization and SWA settings used for the run.
  • Add run artifacts/documentation: training log and README describing the approach and results.
  • Add a submission.json metadata file for record tracking.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py Record training/export script including mixed int5/int6 quantization + SWA logic
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_seed1337.log Training log for the reported run/score
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md Record write-up and reproduction guidance
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/submission.json Record metadata for downstream indexing/leaderboard tooling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab9ecb2f69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…/submission.json

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
- MUON_WD: decoupled weight decay for Muon optimizer (0.04 = SOTA)
  p.data.mul_(1 - lr * wd) before gradient update
- SWA_EVERY: Stochastic Weight Averaging every N steps (50 = SOTA)
  Accumulates running average of model weights, applies at end
- Both controlled via env vars, disabled by default (0)
kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
…hnique)

- MLP weights use int5 [-16,15]: 3 zero high bits per byte → zstd 1.88x
- Attention weights keep int6 [-32,31]: zstd 1.51x
- Saves ~1.86MB artifact → funds 10th transformer layer
- Dequantize auto-detects scheme via qmeta (int5/int6/int8)
@thwu1
Copy link
Copy Markdown
Contributor Author

thwu1 commented Mar 20, 2026

Updated Results: val_bpb = 1.1428 (mean of 3 seeds)

Significant improvements over the original submission (1.1453):

Key Changes

  • bigram_vocab_size: 4096 → 10240 — More hash buckets reduce token-pair collisions (+0.001 bpb)
  • SWA_start_frac: 0.5 → 0.4 — Fewer but more converged SWA checkpoints (+0.0006 bpb)
  • warmdown_iters: 4000 → 3000 — More full-LR training steps (+0.0002 bpb)
  • weight_decay: 0.01 → 0.04 (global) — Better quantization robustness (+0.0002 bpb)

3-Seed Reproduction (eval_stride=64)

Seed val_bpb artifact_bytes valid
42 1.14271 15,965,978
1337 1.14298 15,830,186
2024 1.14260 ~15.8M
Mean 1.14276
Std 0.00016

How to run

bash prepare.sh
bash eval/eval.sh > run.log 2>&1
grep "^val_bpb:\|^valid:" run.log

To run with a specific seed:

SEED=42 bash eval/eval.sh > run.log 2>&1

Full Recipe

  • 10 layers, model_dim=512, 8 heads, 4 KV heads (GQA)
  • Int5 MLP + Int6 attention quantization + zstd-22 compression
  • BigramHash(10240, dim=128) + SmearGate + U-Net skip connections
  • SWA: start_frac=0.4, every=50 steps (24 checkpoint average)
  • Muon optimizer: matrix_lr=0.02, WD=0.04, momentum=0.99
  • AdamW for embeddings/scalars: WD=0.04
  • warmdown=3000 iters, warmup=20 steps, grad_clip=0.3
  • seq_len=2048, batch=786K tokens, seed=42
  • 3% magnitude pruning, sliding window eval stride=64
  • Orthogonal init + muP scaling, tied embeddings in FP16 passthrough

Updating train_gpt.py in the PR now.

…0.04

Key improvements over original 1.1453:
- bigram_vocab_size: 4096 → 10240 (fewer hash collisions)
- SWA_start_frac: 0.5 → 0.4 (more converged checkpoints)
- warmdown: 4000 → 3000 (more full-LR training)
- weight_decay: 0.04 global (both Muon and AdamW)

3-seed results: 1.14271, 1.14298, 1.14260 (mean=1.14276, std=0.00016)
All params set as defaults in train_gpt.py. Run: bash eval/eval.sh
@thwu1 thwu1 changed the title Record: 10L Int5-MLP + MuonWD=0.04 + SWA/50 (val_bpb=1.1453) Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) Mar 20, 2026
@cocohearts cocohearts merged commit ee82226 into openai:main Mar 20, 2026
newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
…q ramp

Adapted from PR #180 SOTA (1.1428 BPB):
- INT5 quantization for MLP weights (int6 for attention) — saves ~1.86MB
- zstd-22 compression instead of zlib — better ratio on sparse int5 data
- 3% magnitude pruning before quantization — zeros compress well
- Sequence length ramp: start at 256, ramp to full at 25% of training
- QAT updated to fake-quantize int5 for MLP, int6 for rest

New env vars: INT5_MLP, USE_ZSTD, ZSTD_LEVEL, PRUNE_PCT,
              SEQ_RAMP_START, SEQ_RAMP_FRAC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
romainsantoli-web pushed a commit to romainsantoli-web/parameter-golf that referenced this pull request Mar 21, 2026
…its)

Combines techniques from PR openai#162, openai#180, openai#267, openai#281:
- 11-layer GPT with U-Net skip connections, GQA
- SmearGate + BigramHash(10240)
- Mixed int5/int6 quantization + 3% magnitude pruning
- Causal TTT at eval time
- SWA(frac=0.4), WD=0.042, Z-loss
- Target: sub-1.135 val_bpb

Awaiting RunPod 8xH100 credits for 3-seed validation.
newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
PR #180 base (10L, BigramHash 10240, SWA, WD 0.04, 1.1428 BPB)
with progressive MLP: skinny-early fat-late layer allocation.
Includes 1-GPU A/B test script (uniform 3.0x vs progressive 1.5-4.5x).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
Switch from SDPA (B,H,T,D) to Flash Attention 3 (B,T,H,D):
- Import flash_attn_interface
- Rotary cache shape [None,:,None,:] for BTHD layout
- CausalSelfAttention: drop transpose, use flash_attn_3_func
- q_gain broadcast adjusted for BTHD

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
Three-way 1xGPU signal test:
  A: Uniform MLP 3.0× (exact #180 control)
  B: Progressive MLP 1.5→4.5× (same total params)
  C: Progressive MLP 1.5→4.5× + BigramHash 16384/192

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA
with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0.
Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA
with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0.
Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fraser-Greenlee pushed a commit to Fraser-Greenlee/parameter-golf that referenced this pull request Mar 22, 2026
… BPB)

Replaces openai#180 base with openai#315's full stack:
- Partial RoPE (16/64 head dims)
- LN Scale (1/sqrt(layer+1) depth dampening)
- XSA on last N layers (efficient GQA-aware, no repeat_interleave)
- EMA weight averaging (decay=0.997)
- FA3 (flash_attn_3_func)
- Late QAT support (note: inactive under torch.compile)
- NTK-aware RoPE for length extrapolation

Attention reuse (ATTN_REUSE=1) carried forward:
- Decoder half layers drop Q,K projections, reuse from last
  encoder layer. Saves ~1.3MB compressed for extra capacity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
…ude pruning

PR openai#180 is the current merged SOTA at 1.1428 BPB (no TTT needed). Key
innovations: mixed int5/int6 quantization saves ~1.86MB for extra layer,
BigramHash(10240) for bigram context, 3% magnitude pruning, orthogonal
init with muP, Muon WD=0.04, and zstd-22 compression.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
parinzee added a commit to parinzee/parameter-golf that referenced this pull request Mar 23, 2026
…1309)

3-seed validation results:
- Seed 42:   val_bpb=1.13109, artifact=15,764,564 bytes
- Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes
- Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes
- Mean: 1.13087 (std: 0.00017)

Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers,
LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization,
zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500.

Built on baseline by @thwu1 (PR openai#180).
SelfAnush added a commit to SelfAnush/parameter-golf that referenced this pull request Mar 23, 2026
… slow TRSM on H100

Non-record: MUD optimizer (arxiv:2603.17970)
Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram
preconditioning. Single seed (42) on 8xH100 SXM.
Results:
- val_bpb: 1.1989 (sliding window eval, stride=64)
- Steps: 5,087 in 10 min
- step_avg: 118ms (4.5x slower than Muon's ~26ms on H100)
Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x
fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP
savings reported in the paper (tested on A100/MI250/GH200).
Built on SOTA by @thwu1 (PR openai#180).
Paper: https://arxiv.org/abs/2603.17970
joey00072 added a commit to joey00072/parameter-golf that referenced this pull request Mar 23, 2026
…0.04

Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joey00072 added a commit to joey00072/parameter-golf that referenced this pull request Mar 23, 2026
Replace every other MLP (layers 0,2,4,6,8) with BigLU — an MLP where the
hidden state is gated by a per-layer bigram embedding (vocab=2048, dim=hidden,
expansion scale=1). Reduce mlp_mult 3.0→1.5 (hidden 1536→768) so total MLP
params stay identical to PR openai#180 (15.73M).

- Muon for up/down weights; AdamW for bigram embed tables (like main bigram)
- bigram.embed excluded from matrix_params to avoid Muon

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DivijChawla pushed a commit to DivijChawla/parameter-golf that referenced this pull request Mar 24, 2026
rt_bpb: 1.14594585 (seed 1337, 8xH200, 10-min wall-clock, 7306 steps)

Built on thwu1's PR openai#180. Only 2 lines changed: BigramHash defaults
updated from vocab=10240 dim=128 to vocab=16384 dim=64.

Same ~1MB embedding budget, wider vocabulary captures more distinct
bigram hash buckets. Proxy sweeps confirmed wider vocab > higher dim
at equal parameter cost. Artifact: 15,923,771 bytes zstd-22.

Single seed, additional seeds pending.
alexanderaperry-arch pushed a commit to alexanderaperry-arch/parameter-golf that referenced this pull request Mar 27, 2026
Non-record research submission. 2x2 factorial ablation of QAT x SWA
interaction on PR openai#180 stack (10L/512d/MLP3x).

Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed
mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse
than either alone. This explains why prior QAT entries underperformed
non-QAT submissions in the competition.

3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants