Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)#180
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new 10-minute / 16MB track record entry under records/track_10min_16mb/ documenting a 10-layer model that uses mixed int5 (MLP) + int6 (attention) post-training quantization, along with Muon weight decay tuning and SWA.
Changes:
- Add the record’s training script (
train_gpt.py) implementing mixed int5/int6 quantization and SWA settings used for the run. - Add run artifacts/documentation: training log and README describing the approach and results.
- Add a
submission.jsonmetadata file for record tracking.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py | Record training/export script including mixed int5/int6 quantization + SWA logic |
| records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_seed1337.log | Training log for the reported run/score |
| records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md | Record write-up and reproduction guidance |
| records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/submission.json | Record metadata for downstream indexing/leaderboard tooling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/submission.json
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab9ecb2f69
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/train_gpt.py
Outdated
Show resolved
Hide resolved
…/submission.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- MUON_WD: decoupled weight decay for Muon optimizer (0.04 = SOTA) p.data.mul_(1 - lr * wd) before gradient update - SWA_EVERY: Stochastic Weight Averaging every N steps (50 = SOTA) Accumulates running average of model weights, applies at end - Both controlled via env vars, disabled by default (0)
…hnique) - MLP weights use int5 [-16,15]: 3 zero high bits per byte → zstd 1.88x - Attention weights keep int6 [-32,31]: zstd 1.51x - Saves ~1.86MB artifact → funds 10th transformer layer - Dequantize auto-detects scheme via qmeta (int5/int6/int8)
Updated Results: val_bpb = 1.1428 (mean of 3 seeds)Significant improvements over the original submission (1.1453): Key Changes
3-Seed Reproduction (eval_stride=64)
How to runbash prepare.sh
bash eval/eval.sh > run.log 2>&1
grep "^val_bpb:\|^valid:" run.logTo run with a specific seed: SEED=42 bash eval/eval.sh > run.log 2>&1Full Recipe
Updating |
…0.04 Key improvements over original 1.1453: - bigram_vocab_size: 4096 → 10240 (fewer hash collisions) - SWA_start_frac: 0.5 → 0.4 (more converged checkpoints) - warmdown: 4000 → 3000 (more full-LR training) - weight_decay: 0.04 global (both Muon and AdamW) 3-seed results: 1.14271, 1.14298, 1.14260 (mean=1.14276, std=0.00016) All params set as defaults in train_gpt.py. Run: bash eval/eval.sh
…q ramp
Adapted from PR #180 SOTA (1.1428 BPB):
- INT5 quantization for MLP weights (int6 for attention) — saves ~1.86MB
- zstd-22 compression instead of zlib — better ratio on sparse int5 data
- 3% magnitude pruning before quantization — zeros compress well
- Sequence length ramp: start at 256, ramp to full at 25% of training
- QAT updated to fake-quantize int5 for MLP, int6 for rest
New env vars: INT5_MLP, USE_ZSTD, ZSTD_LEVEL, PRUNE_PCT,
SEQ_RAMP_START, SEQ_RAMP_FRAC
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…its) Combines techniques from PR openai#162, openai#180, openai#267, openai#281: - 11-layer GPT with U-Net skip connections, GQA - SmearGate + BigramHash(10240) - Mixed int5/int6 quantization + 3% magnitude pruning - Causal TTT at eval time - SWA(frac=0.4), WD=0.042, Z-loss - Target: sub-1.135 val_bpb Awaiting RunPod 8xH100 credits for 3-seed validation.
PR #180 base (10L, BigramHash 10240, SWA, WD 0.04, 1.1428 BPB) with progressive MLP: skinny-early fat-late layer allocation. Includes 1-GPU A/B test script (uniform 3.0x vs progressive 1.5-4.5x). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from SDPA (B,H,T,D) to Flash Attention 3 (B,T,H,D): - Import flash_attn_interface - Rotary cache shape [None,:,None,:] for BTHD layout - CausalSelfAttention: drop transpose, use flash_attn_3_func - q_gain broadcast adjusted for BTHD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three-way 1xGPU signal test: A: Uniform MLP 3.0× (exact #180 control) B: Progressive MLP 1.5→4.5× (same total params) C: Progressive MLP 1.5→4.5× + BigramHash 16384/192 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0. Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)
Combined @thwu1's SOTA (PR openai#180): 10L Int5-MLP BigramHash(10240) SWA with our novel contributions: adaptive softcap (base=20), RoPE 500k, QK 3.0. Result: val_bpb=1.1511, 15.9MB. Would rank ~3rd on official leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… BPB) Replaces openai#180 base with openai#315's full stack: - Partial RoPE (16/64 head dims) - LN Scale (1/sqrt(layer+1) depth dampening) - XSA on last N layers (efficient GQA-aware, no repeat_interleave) - EMA weight averaging (decay=0.997) - FA3 (flash_attn_3_func) - Late QAT support (note: inactive under torch.compile) - NTK-aware RoPE for length extrapolation Attention reuse (ATTN_REUSE=1) carried forward: - Decoder half layers drop Q,K projections, reuse from last encoder layer. Saves ~1.3MB compressed for extra capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ude pruning PR openai#180 is the current merged SOTA at 1.1428 BPB (no TTT needed). Key innovations: mixed int5/int6 quantization saves ~1.86MB for extra layer, BigramHash(10240) for bigram context, 3% magnitude pruning, orthogonal init with muP, Muon WD=0.04, and zstd-22 compression. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
…1309) 3-seed validation results: - Seed 42: val_bpb=1.13109, artifact=15,764,564 bytes - Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes - Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes - Mean: 1.13087 (std: 0.00017) Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers, LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization, zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500. Built on baseline by @thwu1 (PR openai#180).
… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970
…0.04 Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace every other MLP (layers 0,2,4,6,8) with BigLU — an MLP where the hidden state is gated by a per-layer bigram embedding (vocab=2048, dim=hidden, expansion scale=1). Reduce mlp_mult 3.0→1.5 (hidden 1536→768) so total MLP params stay identical to PR openai#180 (15.73M). - Muon for up/down weights; AdamW for bigram embed tables (like main bigram) - bigram.embed excluded from matrix_params to avoid Muon Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rt_bpb: 1.14594585 (seed 1337, 8xH200, 10-min wall-clock, 7306 steps) Built on thwu1's PR openai#180. Only 2 lines changed: BigramHash defaults updated from vocab=10240 dim=128 to vocab=16384 dim=64. Same ~1MB embedding budget, wider vocabulary captures more distinct bigram hash buckets. Proxy sweeps confirmed wider vocab > higher dim at equal parameter cost. Artifact: 15,923,771 bytes zstd-22. Single seed, additional seeds pending.
Non-record research submission. 2x2 factorial ablation of QAT x SWA interaction on PR openai#180 stack (10L/512d/MLP3x). Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse than either alone. This explains why prior QAT entries underperformed non-QAT submissions in the competition. 3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.
10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash
val_bpb: 1.14526 (sliding window stride=64, post int6+zstd roundtrip)
Key Innovation: Mixed Int5/Int6 Quantization
Use int5 [-16,15] for MLP weights and int6 [-32,31] for attention. Int5 has 3 zero high bits per byte — zstd-22 compresses at 1.88x vs int6's 1.51x, saving 1.86MB. This funds a 10th transformer layer while staying under 16MB.
Technique Stack
Metrics
Run Command
Ablation
Built on PR #162 by @unnir.