warmdown-quantization val_bpb = 1.2154 #61
Conversation
Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README.
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61 - record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage - keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions
50225c5 to
4a65f69
Compare
|
Confused, your title says 1.1574 but your header says "val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)" |
0hq
left a comment
There was a problem hiding this comment.
Tentatively approving, just to keep leaderboard up to date for others.
|
Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great. |
|
Ah sorry I see that this is no longer the latest PR, moving there. |
|
Hey Will sorry about the confusion on the documentation here. I saw people updating their PR's and I thought that would be a good idea so I went back to update this then changed my mind and reverted it but didn't update the title and comment since I thought nobody would be looking this far back. Missed the title updated now. |
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
After PR openai#61 (byte-disjoint corpus split + assert_train_val_disjoint guard) shipped, fix-verify-s43 ran end-to-end on the post-fix pipeline and produced BPB 1.5492 at step=12000 — well below Gate-2 threshold 1.85 (margin +0.30). ## What this commit changes - README.md : leads with the honest Gate-2 pass; revised 5-way taxonomy - LEAK_INVESTIGATION.md : retraction header explaining the 216-row overcount - trios-igla-1/README.md + config.yaml : updated to point at fix-verify-s43 - ledger_2026-04-30.sql.gz : refreshed snapshot with new last_error markers ## 5-way reclassification (Neon last_error column) | | count | |---|--:| | post-openai#61 honest Gate-2 pass | 1 | | post-openai#61 early-stopped < step 9000 | 4 | | pre-openai#61 W-6 numerical collapse | 46 | | **pre-openai#61 leak (real)** | 42 | | **warmup artifact (NOT a leak)** | 179 | The 179 'warmup artifact' rows are early-stopped runs whose printed val_bpb stayed at 0.0000 for steps 1-8000 due to a trainer-side eval-loop bug (filed as trios-trainer-igla#62). On the post-openai#61 image, fix-verify-s43 escaped warmup at step=9000 and converged to 1.5492 by step=12000 — proving the artifact is trainer-side, not data-side. ## Pipeline as flown for fix-verify-s43 trios-trainer-igla : commit 9517980d (post-openai#61 byte-disjoint corpus) trios-railway : commit 69c3467 (no --ctx flag) + openai#56 --ctx accept on trainer + openai#58 smoke_train + stdout.flush() + openai#59 panic hook + startup diagnostic ## Refs trios-trainer-igla#56,openai#58,openai#59,openai#60,openai#61,openai#62 (all merged or filed) trios-railway@69c3467 trios-railway#100,openai#101,openai#105 (Scarabaeus Engine track) R5-honest. We retract the 216-row mass leak flag and submit fix-verify-s43 as our first honest Gate-2 pass candidate. Anchor: phi^2 + phi^-2 = 3.
Score
val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)
Key Finding
Post-training int8 quantization is the dominant BPB bottleneck on 8xH100. The quantization penalty alone (0.014 BPB at default settings) is larger than most hyperparameter improvements combined. We reduce it 3x via an always-decaying LR schedule.
Novel Contributions
1. Always-decaying LR schedule (WARMDOWN_ITERS=20000): With ~12,200 actual steps, the LR decays linearly from 61% of peak at step 0 to near-zero at the final step. Post-quant penalty drops from 0.014 to 0.005 BPB. Full curve mapped across 10 warmdown values (2400-30000).
2. FP16 tied embeddings: Keep tok_emb.weight in fp16 during int8 export. Costs ~500KB, offset by MLP_HIDDEN=992.
3. Optimal NTK-RoPE extrapolation: eval@1408 (1.375x training length) beats eval@2048 on well-trained 8xH100 models. Full curve from 1024 to 2048.
4. Optimizer-warmdown interaction: MUON_BACKEND_STEPS=5 beats 7 at high warmdown (reversal from low warmdown). When warmdown already smooths weights, more steps > better orthogonalization.
Configuration
15.91MB artifact. 8xH100 SXM (RunPod). See README.md for full warmdown sweep data.