Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) by nanlliu · Pull Request #39 · openai/parameter-golf

nanlliu · 2026-03-19T00:41:35Z

Summary

Two submissions:

1. 10L Mixed Precision (val_bpb=1.2139 mean across 5 seeds)

10 transformer layers (vs baseline 9) with mixed int8/int6 compression
Full int8 for first/last 3 layers, int6 (step=4 rounding) for middle layers 3-6
Lower LR: MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
Improves post-quant roundtrip validation metrics on the same val set from 1.2244 → 1.2139 bpb and 2.0727 → 2.0496 nats
Artifact: ~15.93MB (under 16MB)

2. Lower LR (val_bpb=1.2230)

Same 9-layer architecture, only LR change

Multi-seed results (5 seeds, p < 0.001)

Seed	val_loss (nats)	val_bpb	Artifact
1337	2.0502	1.2142	15.92MB
42	2.0509	1.2146	15.94MB
123	2.0493	1.2137	15.93MB
7	2.0481	1.2130	15.93MB
2024	2.0493	1.2137	15.94MB
Mean	2.0496	1.2139	15.93MB
Std	0.0010	0.0006

Mean improvement: 0.0231 nats (4.6x the 0.005 threshold)
t-statistic: 39.29 (df=4, one-sided t critical for p<0.001 = 7.17)
p < 0.001
All 5 runs individually beat SOTA by >0.005 nats
All 5 artifacts under 16MB

How mixed precision compression works

10L model has 18.9M params → 17.6MB with standard int8+zlib (over 16MB). By reducing middle layers to int6, compressed size drops to ~15.9MB:

Layer Group	Precision	Reason
Layers 0-2 (early)	int8 (256 levels)	Critical for input processing
Layers 3-6 (middle)	int6 (64 levels)	Less sensitive, saves ~1.6MB
Layers 7-9 (late)	int8 (256 levels)	Critical for output quality

Note on hardware

The baseline run was on 8xH100 and these runs on 8xH200, which reached slightly more training steps (~13.1k vs ~13.0k). This is directionally positive but not a perfectly compute-matched comparison. An H100 verification run is recommended.

Test plan

Post-quant val_bpb verified via int8/int6+zlib roundtrip
Artifact under 16MB (all 5 seeds)
Runs within 600s wallclock
train_gpt.py compiles and runs from records folder
Multiple seed runs: p < 0.001 (5 seeds, t=39.29)
Verify on 8xH100 (pending compute)

Systematic LR sweep showed default Muon/Adam learning rates (0.04) were too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives consistent improvement. Same 9L/512d architecture, no other changes.

10 transformer layers (vs baseline 9) with mixed int8/int6 compression: - Full int8 for first/last 3 layers (precision-sensitive) - Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly) - Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - Artifact: 15,928,974 bytes (under 16MB cap) - Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244) Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py for mixed-precision post-training quantization.

The root script should remain the baseline. Submission-specific modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong in the records/ folder copy.

0hq · 2026-03-19T22:23:54Z

Reversing this, given others have made int6 work here I'll give you credit.

@nanlliu

Added int6 per-row quantization (credit @nanlliu PR openai#39). Switched compression from zlib-9 to zstd-22 (saves ~90KB). Int6 function ready but not yet used in serialization pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@nanlliu

Int6 per-row quantization (credit @nanlliu PR openai#39) saves 25% model size. zstd-22 compression (credit community consensus). MLP 3x fits with int6 (19.2M params in 12.9MB). On 1 GPU, int6 degrades BPB by ~0.016 (not worth it with limited steps). On 8xH100, int6+MLP3x would be the winning formula. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…layers) (openai#39) * Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02) Systematic LR sweep showed default Muon/Adam learning rates (0.04) were too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives consistent improvement. Same 9L/512d architecture, no other changes. * Add 10L Mixed Precision submission: val_bpb=1.2147 10 transformer layers (vs baseline 9) with mixed int8/int6 compression: - Full int8 for first/last 3 layers (precision-sensitive) - Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly) - Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - Artifact: 15,928,974 bytes (under 16MB cap) - Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244) Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py for mixed-precision post-training quantization. * Revert root train_gpt.py to upstream baseline The root script should remain the baseline. Submission-specific modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong in the records/ folder copy.

@nanlliu

Added int6 per-row quantization (credit @nanlliu PR openai#39). Switched compression from zlib-9 to zstd-22 (saves ~90KB). Int6 function ready but not yet used in serialization pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@nanlliu

Int6 per-row quantization (credit @nanlliu PR openai#39) saves 25% model size. zstd-22 compression (credit community consensus). MLP 3x fits with int6 (19.2M params in 12.9MB). On 1 GPU, int6 degrades BPB by ~0.016 (not worth it with limited steps). On 8xH100, int6+MLP3x would be the winning formula. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@nanlliu

Added int6 per-row quantization (credit @nanlliu PR openai#39). Switched compression from zlib-9 to zstd-22 (saves ~90KB). Int6 function ready but not yet used in serialization pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@nanlliu

Int6 per-row quantization (credit @nanlliu PR openai#39) saves 25% model size. zstd-22 compression (credit community consensus). MLP 3x fits with int6 (19.2M params in 12.9MB). On 1 GPU, int6 degrades BPB by ~0.016 (not worth it with limited steps). On 8xH100, int6+MLP3x would be the winning formula. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…layers) (openai#39) * Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02) Systematic LR sweep showed default Muon/Adam learning rates (0.04) were too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives consistent improvement. Same 9L/512d architecture, no other changes. * Add 10L Mixed Precision submission: val_bpb=1.2147 10 transformer layers (vs baseline 9) with mixed int8/int6 compression: - Full int8 for first/last 3 layers (precision-sensitive) - Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly) - Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - Artifact: 15,928,974 bytes (under 16MB cap) - Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244) Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py for mixed-precision post-training quantization. * Revert root train_gpt.py to upstream baseline The root script should remain the baseline. Submission-specific modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong in the records/ folder copy.

nanlliu added 2 commits March 19, 2026 00:41

Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02)

6c0c583

Systematic LR sweep showed default Muon/Adam learning rates (0.04) were too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives consistent improvement. Same 9L/512d architecture, no other changes.

nanlliu changed the title ~~Lower LR: val_bpb=1.2230 (MATRIX_LR=0.02)~~ 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) Mar 19, 2026

Revert root train_gpt.py to upstream baseline

13acc85

The root script should remain the baseline. Submission-specific modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong in the records/ folder copy.

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Closed

nanlliu changed the title ~~10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers)~~ Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) Mar 19, 2026

0hq closed this Mar 19, 2026

0hq reopened this Mar 19, 2026

0hq merged commit 9ac12c2 into openai:main Mar 19, 2026

notapplica mentioned this pull request Mar 19, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

MatoTeziTanka mentioned this pull request Mar 21, 2026

PROTEUS v4 — non-record submission (val_bpb: 1.2037) #368

Open

mtybadger mentioned this pull request Mar 26, 2026

[non-record] Masked Diffusion Language Model (val_var_bpb=1.625) #820

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers)#39

Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers)#39
0hq merged 3 commits intoopenai:mainfrom
nanlliu:lower-lr-submission

nanlliu commented Mar 19, 2026 •

edited

Loading

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nanlliu commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. 10L Mixed Precision (val_bpb=1.2139 mean across 5 seeds)

2. Lower LR (val_bpb=1.2230)

Multi-seed results (5 seeds, p < 0.001)

How mixed precision compression works

Note on hardware

Test plan

Uh oh!

0hq commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nanlliu commented Mar 19, 2026 •

edited

Loading