Skip to content

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141

Open
nglain wants to merge 1 commit intoopenai:mainfrom
nglain:submission/systematic-search
Open

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141
nglain wants to merge 1 commit intoopenai:mainfrom
nglain:submission/systematic-search

Conversation

@nglain
Copy link
Copy Markdown

@nglain nglain commented Mar 20, 2026

Summary

Metric Value
Post-quant val_bpb 1.2075
Pre-quant val_bpb 1.2008
Compressed artifact ~15.2 MB
Training steps 7,390
Training time 600s (8×H100 SXM)

Approach

Methodical hyperparameter search through 33 experiments across three GPU tiers (A40 → 1×H100 → 8×H100), using fixed-seed paired comparison (SEED=1337) for reliable delta measurement (±0.001 BPB).

What works

  • Muon optimizer (lr=0.02, momentum=0.99, warmdown=3000): -0.005 BPB
  • ROPE_BASE=200000: -0.003 BPB
  • seq_len=4096: -0.006 BPB

What doesn't work

  • int6 STE + Muon: conflicts (+0.007 worse)
  • 12 layers: too slow, fewer steps
  • Larger batch (786K): fewer steps outweighs quality

Key insight

Optimal hyperparameters differ dramatically across compute budgets. The optimal LR on A40/2min (0.10) is 5× the optimal on 8×H100/10min (0.02). Parameters must be re-validated at target compute scale.

Changes from baseline

Only hyperparameters: MATRIX_LR=0.02, MUON_MOMENTUM=0.99, WARMDOWN_ITERS=3000, ROPE_BASE=200000, TRAIN_SEQ_LEN=4096. No architectural changes.

Test plan

  • Trained on 8×H100 SXM, 600s wallclock
  • final_int8_zlib_roundtrip val_bpb: 1.2075
  • Artifact under 16,000,000 bytes
  • train_gpt.py compiles and runs from records folder
  • train.log included

Methodical search through 33 experiments across A40, 1xH100, 8xH100.
Fixed-seed paired comparison (SEED=1337) for reliable delta measurement.

Key findings:
- Muon optimizer (lr=0.02, momentum=0.99, warmdown=3000): -0.005 BPB
- ROPE_BASE=200000: -0.003 BPB
- seq_len=4096: -0.006 BPB
- int6 STE conflicts with Muon optimizer (+0.007 worse)
- Hyperparameter transfer across compute scales is unreliable

val_bpb: 1.2075 (post-quant roundtrip)
Artifact: ~15.2 MB (under 16 MB cap)
Trained on 8xH100 SXM, 600s wallclock, 7390 steps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant