Skip to content

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)#86

Merged
cocohearts merged 3 commits intoopenai:mainfrom
aruniyer:submission/10L-lowlr-fp16embed-int6
Mar 20, 2026
Merged

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)#86
cocohearts merged 3 commits intoopenai:mainfrom
aruniyer:submission/10L-lowlr-fp16embed-int6

Conversation

@aruniyer
Copy link
Copy Markdown

The README content covers everything: 5-seed results, t-stat, methodology

… 1.2129)

10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129
across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats
(t=34.12, p<<0.001).

Key changes:
- 10 layers (vs 9 baseline)
- Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- FP16 tied embedding export (reduces quant gap)
- Int6 quantization for middle layers 2-7 (fits under 16MB)

Mean artifact size: 15.36MB (under 16MB cap).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB).

Key changes:
- 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params)
- QAT: STE fake-quantize simulates int6 during training
- Int6 quantization on all block weights (layers 0-8)
- Sliding window eval (stride=64) for ~0.033 BPB free gain
- FP16 tied embedding + lower LRs (carried over)

5-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1652 (std=0.0017)
  Mean rt_bpb:    1.1985
  t-statistic:    78.93 (p << 0.001)
  All artifacts under 16MB (mean: 15.64MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aruniyer aruniyer changed the title 10L Mixed Precision: val_bpb=1.2129 (lower LR + fp16 embed + int6 middle) Update: MLP 3x + QAT + Int6 + Sliding Window (val_bpb 1.1652) Mar 20, 2026
Major upgrade: 11 layers + decoupled weight decay + zstd-22 compression.

Key changes:
- 11 layers (was 9) — more depth, funded by int6+zstd compression
- Weight decay 0.04 on Muon + AdamW — quantization-friendly weights
- zstd-22 compression — saves 1.5MB vs zlib, critical for 11L fit
- Higher Muon momentum (0.99) + warmup tuning
- SWA attempted but dropped (hurts with QAT)

3-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1502 (std=0.0004)
  t-statistic: 313.20 (p << 0.001)
  All artifacts under 16MB (mean 15.4MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aruniyer aruniyer changed the title Update: MLP 3x + QAT + Int6 + Sliding Window (val_bpb 1.1652) Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502) Mar 20, 2026
@cocohearts cocohearts merged commit b774930 into openai:main Mar 20, 2026
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
…mbed-int6

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants