Skip to content

Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)#63

Merged
cocohearts merged 4 commits intoopenai:mainfrom
yahya010:submission/seq2048-fp16emb
Mar 20, 2026
Merged

Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)#63
cocohearts merged 4 commits intoopenai:mainfrom
yahya010:submission/seq2048-fp16emb

Conversation

@yahya010
Copy link
Copy Markdown

@yahya010 yahya010 commented Mar 19, 2026

Summary

11 techniques stacked on the Naive Baseline, achieving mean val_bpb 1.1598 (3 seeds):

  1. 10 transformer layers (from 9)
  2. STE int6 QAT — fake quantization during training eliminates quant gap entirely (pre-quant = post-quant)
  3. Full int6 quantization [-31,31] + zstd-22 compression
  4. MLP hidden 1344 (2.625x model_dim) — enabled by int6+zstd savings
  5. FP16 tied embedding passthrough
  6. Sequence length 2048
  7. Muon momentum 0.99 (warmup from 0.92 over 1500 steps)
  8. Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02
  9. Gradient clipping 0.3
  10. Warmdown 3600
  11. Sliding window evaluation stride=64

Results

Seed Steps val_bpb (standard) val_bpb (sliding) Artifact size
1337 8,319 1.1821 1.1610 15,558,319
42 ~8,300 ~1.1815 1.1598 ~15,558,000
3 ~8,300 ~1.1810 1.1586 ~15,558,000

Mean val_bpb (sliding): 1.1598 (std: 0.00120)
Quant gap: 0.0000 — STE QAT completely eliminated quantization loss.

Statistical significance vs baseline (2.0727 val_loss):

  • Improvement: 0.1144 nats, t=-93.6, p << 0.01

Hardware: 8xH100 80GB HBM3, PyTorch 2.8.0+cu128, ~72ms/step.
Requires: pip install zstandard

Test plan

  • 3 seeds on 8xH100, all under 600s wallclock
  • All artifacts under 16MB (15.56MB)
  • Sliding window eval under 600s (~370s)
  • Statistical significance p << 0.01
  • Post-quant roundtrip validation matches

South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70
- promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63
- refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65
- update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
@yahya010 yahya010 changed the title Seq2048 + FP16 Tied Embedding + Tuned LR (val_bpb 1.2067) 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598) Mar 19, 2026
@yahya010
Copy link
Copy Markdown
Author

Updated submission to val_bpb 1.1598 (3-seed mean, sliding window stride=64). Key techniques: 10L, STE int6 QAT (zero quant gap), full int6+zstd-22, MLP 1344, fp16 tied embedding, Muon 0.99, seq2048, grad clip 0.3. All constraints met (15.56MB artifact, 600s training, 370s eval). Ready for review.

yahya010 and others added 4 commits March 19, 2026 22:11
3-seed validation: mean 1.2067 BPB (std 0.00044), improvement 0.0353 nats
over baseline, t=-70.69 (p << 0.01).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3: Added 10th layer with mixed int8/int6 quantization (middle layers),
plus sliding window evaluation (stride=64). 3-seed mean 1.1793 BPB,
improvement 0.0815 nats over baseline, t=-137 (p << 0.01).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V4b: Full int6 quantization [-31,31] + zstd-22 compression enables
MLP expansion to 1344 (2.6x). Muon momentum 0.99, LR 0.02, grad clip 0.3.
3-seed mean 1.1632 BPB (sliding window), 0.1087 nats over baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V6b: Added straight-through estimator fake int6 quantization during
training. Completely eliminates quantization gap (pre-quant = post-quant).
3-seed mean 1.1598 BPB (sliding window), beating previous leader (1.1605).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yahya010 yahya010 force-pushed the submission/seq2048-fp16emb branch from d360e10 to 510e3f6 Compare March 19, 2026 22:11
@yahya010 yahya010 changed the title 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598) Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598) Mar 19, 2026
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Do you mind creating a new PR when you edit things? Not chronological credit note in FAQ. Otherwise keep as [WIP] or Draft PR until it's fully ready and locked.

@yahya010
Copy link
Copy Markdown
Author

Yes! Would you like me to remove the additional commits past the ready for review mark? Right now this is ready, so I won’t make any more changes here

@cocohearts cocohearts merged commit 80f7a21 into openai:main Mar 20, 2026
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants