Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114
Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114saml212 wants to merge 1 commit intoopenai:mainfrom
Conversation
Root cause: quantize_float_tensor used raw abs().amax() for scale computation. With only 31 int6 levels, outlier weights inflate the scale and collapse most values to zero. Fixed by using percentile clipping (99.99984th quantile) matching PR openai#114's proven approach. Also fixed asymmetric range [-32,31] -> symmetric [-31,31]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the great submission! Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great. |
- SWA during warmdown: average ~7 checkpoints in second half of warmdown for flatter loss landscape generalization (-0.003-0.005 BPB) - GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep) - TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114) - EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@0hq totally fair, happy to rerun! My plane is taking off now though and Runpod doesn't have an 8xH100 pod available currently so I won't get to this until late tonight. |
|
@0hq Verification complete. Reran seeds 1337/1338/1339 on a fresh pod with the official RunPod template (py3.12 + torch 2.9.1+cu128):
Mean matches the original submission exactly. Seed-to-seed variance (std=0.0022) is from training stochasticity — Muon's distributed gradient round-trip introduces small numerical differences across runs. The original 3 seeds on my dev pod had std=0.0001 which was unusually tight. Note: first run on a fresh pod was 100ms/step (cold torch.compile cache). After cache warms up it stabilizes at ~81.7ms/step. The verification seeds above are all warm-cache runs. Verification logs pushed to the branch. |
…2ms/step faster
df4e221 to
6cc7150
Compare
6cc7150 to
2669909
Compare
|
3-seed verified, diff cleaned up |
val_bpb = 1.1574 — Top-ranked valid submission at time of posting
Baseline: 1.2244. Improvement: -0.067 BPB / -0.120 nats. 15.98MB artifact. 600s training + 240s sliding window eval on 8xH100 SXM.
This is the third PR in a series (see also PR #61 at 1.2154 and PR #96 at 1.1764), each building on systematic experimentation across ~60 runs.
What's Novel
1. Train@2048 matches train@4096 when using sliding window eval. I tested both extensively — training at 2048 gives identical BPB to training at 4096 (1.1764 vs 1.1765) when evaluated with sliding window. The window already provides long context at eval; the model just needs to learn local patterns. Training at 2048 is strictly better because it gets more optimization steps in the 10-minute budget. Other top submissions train at 1024 or 4096 — 2048 is the overlooked sweet spot.
2. GRAD_CLIP_NORM=0.3 stabilizes long-sequence training. I swept from 0.0 to 1.0 and found a narrow optimum at 0.3 for train@2048. The mechanism: longer sequences produce higher gradient variance, and tight clipping (0.3 vs the common 1.0) stabilizes without over-constraining. Full sweep data in experiment log.
3. Batch=786K is optimal for train@2048. Swept from 393K to 1M. The default 524K is suboptimal; 786K balances gradient noise against step count.
4. Selective precision preservation. Two techniques stacked: (a) FP16 tied embedding — the dual-role embedding/output matrix is disproportionately damaged by quantization, discovered via systematic warmdown-quantization analysis in PR #61; (b) Late-K passthrough — last 2 layers' key projections kept in fp16 for sharper late-layer attention.
5. Stride=256 beats stride=64. Counter to other submissions using stride=64, I found stride=256 gives slightly better BPP (1.1574 vs 1.1579) at 4x less eval time. Diminishing returns from smaller strides.
Approach
Int6 post-training quantization (per-row scaling, ±31 range) compresses weight matrices ~25% smaller than int8, freeing artifact space to triple the MLP hidden dimension (1024 → 1536). 21.8M parameters in 15.98MB.
Training
Evaluation
Sliding window with stride=256 at eval_seq_len=2048. Every scored token sees 1792+ tokens of context. 240s eval time on 8xH100 (well within the separate 10-min eval budget).
Progression
Full experiment log with ~60 runs, sweep data, and negative results available in the repository.