Skip to content

warmdown-quantization val_bpb = 1.2154 #61

Merged
0hq merged 2 commits intoopenai:mainfrom
saml212:sam/warmdown-quantization
Mar 19, 2026
Merged

warmdown-quantization val_bpb = 1.2154 #61
0hq merged 2 commits intoopenai:mainfrom
saml212:sam/warmdown-quantization

Conversation

@saml212
Copy link
Copy Markdown
Contributor

@saml212 saml212 commented Mar 19, 2026

Score

val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)

Key Finding

Post-training int8 quantization is the dominant BPB bottleneck on 8xH100. The quantization penalty alone (0.014 BPB at default settings) is larger than most hyperparameter improvements combined. We reduce it 3x via an always-decaying LR schedule.

Novel Contributions

1. Always-decaying LR schedule (WARMDOWN_ITERS=20000): With ~12,200 actual steps, the LR decays linearly from 61% of peak at step 0 to near-zero at the final step. Post-quant penalty drops from 0.014 to 0.005 BPB. Full curve mapped across 10 warmdown values (2400-30000).

2. FP16 tied embeddings: Keep tok_emb.weight in fp16 during int8 export. Costs ~500KB, offset by MLP_HIDDEN=992.

3. Optimal NTK-RoPE extrapolation: eval@1408 (1.375x training length) beats eval@2048 on well-trained 8xH100 models. Full curve from 1024 to 2048.

4. Optimizer-warmdown interaction: MUON_BACKEND_STEPS=5 beats 7 at high warmdown (reversal from low warmdown). When warmdown already smooths weights, more steps > better orthogonalization.

Configuration

WARMDOWN_ITERS=20000 MATRIX_LR=0.06 TIED_EMBED_LR=0.07 SCALAR_LR=0.06
GRAD_CLIP_NORM=1.0 MUON_BACKEND_STEPS=5 EVAL_SEQ_LEN=1408
+ FP16 tied embedding + MLP_HIDDEN=992

15.91MB artifact. 8xH100 SXM (RunPod). See README.md for full warmdown sweep data.

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization
penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate
NTK-RoPE extrapolation (eval@1408).

Full warmdown sweep across 10 values and detailed analysis in README.
@saml212 saml212 changed the title Warmdown-Quantization: val_bpb=1.2154 Long-context sliding window: val_bpb=1.1793 Mar 19, 2026
@saml212 saml212 changed the title Long-context sliding window: val_bpb=1.1793 Long-context sliding window: val_bpb=1.1780 Mar 19, 2026
@saml212 saml212 changed the title Long-context sliding window: val_bpb=1.1780 Long-context sliding window: val_bpb=1.1769 Mar 19, 2026
@saml212 saml212 changed the title Long-context sliding window: val_bpb=1.1769 Long-context sliding window: val_bpb=1.1764 Mar 19, 2026
South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70
- promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63
- refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65
- update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61
- record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage
- keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions
@jordankzf jordankzf mentioned this pull request Mar 19, 2026
@saml212 saml212 force-pushed the sam/warmdown-quantization branch from 50225c5 to 4a65f69 Compare March 19, 2026 15:51
@saml212 saml212 changed the title Long-context sliding window: val_bpb=1.1764 Warmdown-Quantization: val_bpb=1.2154 (novel LR-quantization co-optimization) Mar 19, 2026
@saml212 saml212 changed the title Warmdown-Quantization: val_bpb=1.2154 (novel LR-quantization co-optimization) Int6 + MLP 3x + sliding window: val_bpb=1.1574 Mar 19, 2026
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Confused, your title says 1.1574 but your header says "val_bpb = 1.2154 (baseline: 1.2244, improvement: 0.009 BPB / 0.017 nats)"

Copy link
Copy Markdown
Collaborator

@0hq 0hq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentatively approving, just to keep leaderboard up to date for others.

@0hq 0hq merged commit 555669e into openai:main Mar 19, 2026
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great.

@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Ah sorry I see that this is no longer the latest PR, moving there.

@saml212 saml212 changed the title Int6 + MLP 3x + sliding window: val_bpb=1.1574 warmdown-quantization val_bpb = 1.2154 Mar 20, 2026
@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 20, 2026

Hey Will sorry about the confusion on the documentation here. I saw people updating their PR's and I thought that would be a good idea so I went back to update this then changed my mind and reverted it but didn't update the title and comment since I thought nobody would be looking this far back. Missed the title updated now.

scottspace pushed a commit to scottspace/parameter-golf that referenced this pull request Mar 21, 2026
* Warmdown-quantization co-optimization, val_bpb=1.2154

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization
penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate
NTK-RoPE extrapolation (eval@1408).

Full warmdown sweep across 10 values and detailed analysis in README.

* breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256

---------

Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
* Warmdown-quantization co-optimization, val_bpb=1.2154

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization
penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate
NTK-RoPE extrapolation (eval@1408).

Full warmdown sweep across 10 values and detailed analysis in README.

* breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256

---------

Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
* Warmdown-quantization co-optimization, val_bpb=1.2154

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization
penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate
NTK-RoPE extrapolation (eval@1408).

Full warmdown sweep across 10 values and detailed analysis in README.

* breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256

---------

Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants