Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141
Open
nglain wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141nglain wants to merge 1 commit intoopenai:mainfrom
nglain wants to merge 1 commit intoopenai:mainfrom
Conversation
Methodical search through 33 experiments across A40, 1xH100, 8xH100. Fixed-seed paired comparison (SEED=1337) for reliable delta measurement. Key findings: - Muon optimizer (lr=0.02, momentum=0.99, warmdown=3000): -0.005 BPB - ROPE_BASE=200000: -0.003 BPB - seq_len=4096: -0.006 BPB - int6 STE conflicts with Muon optimizer (+0.007 worse) - Hyperparameter transfer across compute scales is unreliable val_bpb: 1.2075 (post-quant roundtrip) Artifact: ~15.2 MB (under 16 MB cap) Trained on 8xH100 SXM, 600s wallclock, 7390 steps
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Approach
Methodical hyperparameter search through 33 experiments across three GPU tiers (A40 → 1×H100 → 8×H100), using fixed-seed paired comparison (SEED=1337) for reliable delta measurement (±0.001 BPB).
What works
What doesn't work
Key insight
Optimal hyperparameters differ dramatically across compute budgets. The optimal LR on A40/2min (0.10) is 5× the optimal on 8×H100/10min (0.02). Parameters must be re-validated at target compute scale.
Changes from baseline
Only hyperparameters: MATRIX_LR=0.02, MUON_MOMENTUM=0.99, WARMDOWN_ITERS=3000, ROPE_BASE=200000, TRAIN_SEQ_LEN=4096. No architectural changes.
Test plan