12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76
12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76unixmadtoonslab wants to merge 6 commits intoopenai:mainfrom
Conversation
|
This submission is ready for review. Merge conflicts have been resolved. val_bpb: 1.1599 (3-seed mean on 8xH100 SXM, 10 min) The 'not ready for review' label is stale — I don't have permissions to remove it. This is a complete submission with validated results. |
Compute Credits FeedbackI want to flag a significant issue with the compute credit program for this challenge. OpenAI advertises $1,000,000 in compute credits to help participants. I applied for credits through the official form requesting $500 — a modest amount for a competition that requires 8xH100 SXM pods at ~$20/hr. I received $25. $25 buys roughly one single training run on 8xH100s. That's not enough to even validate a baseline, let alone iterate on ideas. For context, I've spent over $300 out of my own pocket just to develop and test the techniques in this submission (QAT scheduling, int6 quantization, ablation studies across dozens of runs). I understand compute is expensive and there are many participants, but $25 out of a $1M pool feels disconnected from the stated goal of helping people "get started training their models." A single 8xH100 validation run costs more than the entire grant. Is this the level of support OpenAI intended for active participants who are pushing the leaderboard? I'd appreciate clarity on the credit allocation process, or an increase that would allow meaningful experimentation. |
… ~1.160) Four orthogonal improvements stacked: int6 mixed-precision quantization on MLP+attention weights with zstd-22 compression, 3x MLP expansion, fp16 tied embedding passthrough, and sliding window evaluation. Awaiting 8xH100 SXM compute credits for official run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key improvements over baseline: - Delayed QAT: STE fake-quantization only in last 15% of training time, allowing model to train at full precision before adapting to quantization - Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31] - Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup - Sliding window eval with stride=64 for better BPB measurement - fp16 embedding passthrough (tok_emb kept unquantized) 3-seed validation (seeds 1337, 42, 7): 1.15924, 1.15980, 1.16066 → mean 1.15990 BPB Beats current openai#1 (PR openai#88) at 1.1605 BPB.
f9ab40a to
0d64b28
Compare
…BPB) Major improvements over v6 baseline (1.1599 -> 1.1555 BPB): - 11 layers with orthogonal init (1/sqrt(2*N) output scaling) - SmearGate: blend token embeddings with previous token via learned gate - BigramHash: 4096-bucket hash embedding for token-pair context - Stochastic Weight Averaging during warmdown (interval=100) - Separate Muon/Adam weight decay (muon_wd=0.04, adam_wd=0.0) - FA3/SDPA dual attention path with NTK-aware RoPE - GQA support (8 heads, 4 KV heads) - QAT fraction configurable (disabled by default - fixes STE bug) - Higher LR (0.03) with lower momentum (0.97) - All hyperparameters configurable via environment variables 3-seed validation: 1.15520, 1.15492, 1.15649 (mean 1.15554) Artifact size: ~14.5MB (1.5MB headroom under 16MB limit)
…B, 3-seed mean) Key improvements over v8d (1.1555): - 12 layers (was 11) funded by int5 compression savings - Mixed int5/int6 quantization: int5 for MLP weights (better zstd ratio) - LR=0.025, momentum=0.97, warmdown=2000, SWA/50 - Warmdown fix: ignore torch.compile overhead in step timing - PyTorch 2.8 SDPA: ~59ms/step at seq1024 (was 81ms on 2.4) 3-seed validation: 1.14618 / 1.14768 / 1.14641 = mean 1.14676 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major improvement from momentum 0.97→0.98 and reduced bigram to fit 16MB. 3-seed: 1.14375 / 1.14316 / 1.14289 = mean 1.14327 BPB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@unixmadtoonslab we'll be handing out higher compute credit grants shortly |
|
@unixmadtoonslab pls clean ur diff to only include sota submission |
Int6 per-row quantization (QUANT_RANGE=31) + zstd-22 compression fits MLP 3x in 16MB. seq1024 for max steps (~12K on 8xH100). Sliding window stride=64. Muon 0.99, LR=0.02, warmdown=3000. FP16 embedding. No QAT (overhead not worth it per PR openai#76). Targets ~1.16 BPB matching top submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
12-layer transformer (dim=512, 8H/4KV GQA) achieving 1.14327 BPB (3-seed mean: 1.14375 / 1.14316 / 1.14289).
Key techniques
Results
Config
Test plan
Submission by Will DePue