warmdown-quantization val_bpb = 1.2154 #61
Conversation
Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README.
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61 - record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage - keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions
50225c5 to
4a65f69
Compare
|
Confused, your title says 1.1574 but your header says "val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)" |
0hq
left a comment
There was a problem hiding this comment.
Tentatively approving, just to keep leaderboard up to date for others.
|
Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great. |
|
Ah sorry I see that this is no longer the latest PR, moving there. |
|
Hey Will sorry about the confusion on the documentation here. I saw people updating their PR's and I thought that would be a good idea so I went back to update this then changed my mind and reverted it but didn't update the title and comment since I thought nobody would be looking this far back. Missed the title updated now. |
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
Score
val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)
Key Finding
Post-training int8 quantization is the dominant BPB bottleneck on 8xH100. The quantization penalty alone (0.014 BPB at default settings) is larger than most hyperparameter improvements combined. We reduce it 3x via an always-decaying LR schedule.
Novel Contributions
1. Always-decaying LR schedule (WARMDOWN_ITERS=20000): With ~12,200 actual steps, the LR decays linearly from 61% of peak at step 0 to near-zero at the final step. Post-quant penalty drops from 0.014 to 0.005 BPB. Full curve mapped across 10 warmdown values (2400-30000).
2. FP16 tied embeddings: Keep tok_emb.weight in fp16 during int8 export. Costs ~500KB, offset by MLP_HIDDEN=992.
3. Optimal NTK-RoPE extrapolation: eval@1408 (1.375x training length) beats eval@2048 on well-trained 8xH100 models. Full curve from 1024 to 2048.
4. Optimizer-warmdown interaction: MUON_BACKEND_STEPS=5 beats 7 at high warmdown (reversal from low warmdown). When warmdown already smooths weights, more steps > better orthogonalization.
Configuration
15.91MB artifact. 8xH100 SXM (RunPod). See README.md for full warmdown sweep data.