Skip to content

Record: 11L Int6 QAT + SmearGate + WD 0.038 (val_bpb=1.1502)#192

Open
baudrillardsgh0st wants to merge 2 commits intoopenai:mainfrom
baudrillardsgh0st:submit/11L-qat-smeargate-wd038
Open

Record: 11L Int6 QAT + SmearGate + WD 0.038 (val_bpb=1.1502)#192
baudrillardsgh0st wants to merge 2 commits intoopenai:mainfrom
baudrillardsgh0st:submit/11L-qat-smeargate-wd038

Conversation

@baudrillardsgh0st
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1502 (single seed 1337), 15.50MB artifact
  • 11-layer GPT with int6 QAT (STE), SmearGate, decoupled Muon WD=0.038
  • Int6-in-int8 containers + zstd-22 compression
  • Sliding window eval (stride=64, batch=32)
  • 7,723 steps at 77ms/step on 8×H100 SXM

Key Techniques

  1. 11 layers (512 dim, 8 heads, 4 KV heads, MLP 3x) — more depth enabled by int6 compression
  2. Int6 QAT: STE fake quantization during forward pass, nearly eliminates post-quant degradation
  3. SmearGate: Learned gate blending current + previous token embedding (~513 params)
  4. Decoupled Muon WD=0.038: Keeps weights small for better int6 quantization
  5. Int6-in-int8 + zstd-22: 26.5M params compressed to 15.5MB

Test plan

  • Single seed run (1337) completed
  • Additional seeds for statistical significance
  • Artifact under 16MB (15,495,792 bytes)
  • Training under 10 minutes (600s wall clock)

🤖 Generated with Claude Code

baudrillardsgh0st and others added 2 commits March 19, 2026 23:59
9L 512dim int6 QAT with STE, SmearGate, Muon weight decay 0.01,
int6-in-int8 zstd22 compression. 14.77MB artifact, 9706 steps @ 61.8ms/step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11-layer GPT with int6 QAT, SmearGate, and decoupled Muon weight decay 0.038.
Artifact: 15.50MB (int6+zstd-22). Single seed, 7723 steps at 77ms/step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baudrillardsgh0st pushed a commit to baudrillardsgh0st/parameter-golf that referenced this pull request Mar 20, 2026
Key improvements over prior submission (openai#192, 1.1502):
- Per-dimension SmearGate (sigmoid(Parameter(dim))) vs scalar gate
- Stochastic Weight Averaging every 50 steps over last 50% of training
- Result: 1.1453 BPB, beating current SOTA (1.1458)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant