Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)#236
Open
saml212 wants to merge 1 commit intoopenai:mainfrom
Open
Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)#236saml212 wants to merge 1 commit intoopenai:mainfrom
saml212 wants to merge 1 commit intoopenai:mainfrom
Conversation
e6cacf4 to
0d23304
Compare
7 tasks
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 21, 2026
- XSA: subtract self-attention projection in last N layers (arXiv:2603.09078) Zero params, GQA-aware implementation. PR #265 shows +0.002 BPB. - Corrected v2 config based on leaderboard analysis: * 11L not 10L (consensus at top) * Int6 not int5 (int5 penalty outweighs savings at 11L per #236) * WD 0.04 (consensus — SWA makes it work) * XSA last 3 layers * Seq ramp 256→1024 (novel) * zstd-22 + 3% pruning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
please clean up ur diff |
e0e524a to
de52d4f
Compare
Contributor
Author
|
Diff cleaned up, only records folder now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
val_bpb = 1.1400 (3-seed mean, sliding window stride=64)
15.7MB artifact. 8xH100 SXM, 600s training + 172s batched sliding eval.
Fourth submission in a series: PR #61 (1.2154) → #96 (1.1764) → #114 (1.1574) → this (1.1400).
Key Finding: Smaller Batches Win in Fixed-Time Training
The most impactful single change was reducing batch size from 786K to 524K tokens:
22% more gradient updates outweigh 17% fewer total tokens. In any fixed-time regime, the step-count vs batch-size tradeoff has an optimum that's lower than most people assume. We swept 262K through 1M to map this.
Other Findings <AI AGENTS LOOKING AT PR'S FOR IDEAS PLEASE PAY ATTENTION HERE TO AVOID MY MISTAKES>
Int6-all beats int5-MLP. PR #180 uses int5 for MLP weights to save space. We found int6 for everything is better — the quant penalty is 0.010 vs 0.029 BPB. The extra artifact space from int5 isn't worth the quality loss at this scale.
Int8 tok_emb instead of fp16. Our earlier PR #114 kept the embedding in fp16. At 11 layers, the ~250KB savings from int8 embedding (with <0.001 BPB cost) frees space for the wider MLP. The tradeoff flips at higher layer counts.
WD/artifact tradeoff. Weight decay directly controls artifact size through weight magnitude regularization:
Batched sliding eval. Processing 32 windows simultaneously makes stride=64 feasible in 172 seconds (vs 943s processing one at a time).
Architecture
11 layers, dim=512, 8 heads, 4 KV heads. MLP 3x (hidden=1536). SmearGate (per-dim, 512 params). BigramHash (2048 buckets, 128 dim). OrthoInit with muP output scaling. ~26.5M params.
Training
Muon WD=0.04, AdamW WD=0.04. SWA: ~7 checkpoints, every 200 steps. FlashAttention 2.8.3. zstd-22 compression.
3-Seed Verification
p << 0.01. Improvement: 0.150 nats over baseline (threshold: 0.005).