Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)#63
Merged
cocohearts merged 4 commits intoopenai:mainfrom Mar 20, 2026
Merged
Conversation
5 tasks
South-33
added a commit
to South-33/parameter-golf
that referenced
this pull request
Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
Author
|
Updated submission to val_bpb 1.1598 (3-seed mean, sliding window stride=64). Key techniques: 10L, STE int6 QAT (zero quant gap), full int6+zstd-22, MLP 1344, fp16 tied embedding, Muon 0.99, seq2048, grad clip 0.3. All constraints met (15.56MB artifact, 600s training, 370s eval). Ready for review. |
3-seed validation: mean 1.2067 BPB (std 0.00044), improvement 0.0353 nats over baseline, t=-70.69 (p << 0.01). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3: Added 10th layer with mixed int8/int6 quantization (middle layers), plus sliding window evaluation (stride=64). 3-seed mean 1.1793 BPB, improvement 0.0815 nats over baseline, t=-137 (p << 0.01). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V4b: Full int6 quantization [-31,31] + zstd-22 compression enables MLP expansion to 1344 (2.6x). Muon momentum 0.99, LR 0.02, grad clip 0.3. 3-seed mean 1.1632 BPB (sliding window), 0.1087 nats over baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V6b: Added straight-through estimator fake int6 quantization during training. Completely eliminates quantization gap (pre-quant = post-quant). 3-seed mean 1.1598 BPB (sliding window), beating previous leader (1.1605). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d360e10 to
510e3f6
Compare
Collaborator
|
Do you mind creating a new PR when you edit things? Not chronological credit note in FAQ. Otherwise keep as [WIP] or Draft PR until it's fully ready and locked. |
Author
|
Yes! Would you like me to remove the additional commits past the ready for review mark? Right now this is ready, so I won’t make any more changes here |
leonardcser
pushed a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
11 techniques stacked on the Naive Baseline, achieving mean val_bpb 1.1598 (3 seeds):
Results
Mean val_bpb (sliding): 1.1598 (std: 0.00120)
Quant gap: 0.0000 — STE QAT completely eliminated quantization loss.
Statistical significance vs baseline (2.0727 val_loss):
Hardware: 8xH100 80GB HBM3, PyTorch 2.8.0+cu128, ~72ms/step.
Requires:
pip install zstandardTest plan