Int6 MLP3x + Tuned LR + SmearGate + SlidingWindow (val_bpb: 1.1618)#102
Open
unnir wants to merge 2 commits intoopenai:mainfrom
Open
Int6 MLP3x + Tuned LR + SmearGate + SlidingWindow (val_bpb: 1.1618)#102unnir wants to merge 2 commits intoopenai:mainfrom
unnir wants to merge 2 commits intoopenai:mainfrom
Conversation
5 tasks
5 tasks
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Credit: @unnir PR openai#102 (SmearGate/BigramHash). Combined with all our innovations for the definitive 8xH100 run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Credit: @unnir PR openai#102 (SmearGate/BigramHash). Combined with all our innovations for the definitive 8xH100 run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Int6 MLP3x + Tuned LR + SmearGate + SlidingWindow
Summary
Four orthogonal improvements stacked on the baseline 9-layer, 512-dim GPT:
Int6 mixed quantization + zstd-22: Per-row int6 quantization ([-32,31]) on MLP and attention weight matrices, fp16 passthrough for tied embeddings, compressed with zstd level 22. Saves ~4MB vs int8+zlib, enabling wider MLP.
3x MLP expansion: MLP hidden dimension 1536 (3x model_dim), up from baseline 1024 (2x). The freed budget from int6 quantization pays for the extra parameters. Provides ~0.019 BPB improvement.
Tuned optimizer hyperparameters: Halved learning rates (
matrix_lr=0.02,scalar_lr=0.02,tied_embed_lr=0.03), higher Muon momentum (0.99with warmup from0.92over 1500 steps), extended warmdown (3000 iterations), gradient clipping (norm=1.0). These changes improve convergence within the 10-minute training budget.SmearGate: A learned gate that blends each token's embedding with the previous token's embedding before the first transformer layer, providing bigram-like information at negligible parameter cost (~512 params).
Evaluation uses sliding window with stride=64: each token is scored with nearly full 1024-token context, improving BPB by ~0.03 over non-overlapping evaluation. Eval completes in ~64 seconds on 8xH100.
Configuration
Command
Key Metrics
val_loss:2.0060 val_bpb:1.1881val_loss:2.0180 val_bpb:1.1952(0.007 BPB quant degradation)val_loss:1.9617 val_bpb:1.1618(eval time: 64s)Approach Details
Quantization Strategy
The key insight: use int6 (6-bit, 64 levels) for the large MLP and attention weight matrices, which are robust to quantization, while keeping the sensitive tied embedding in fp16. This mixed approach gives much better quality than uniform int6 while saving enough space to fit MLP 3x.
tok_emb.weight): kept as fp16 (no quantization)SmearGate
A simple pre-attention module that blends position
t's embedding with positiont-1's embedding via a learned sigmoid gate:This provides each position with bigram information before the first attention layer sees it. Cost: 512 learnable parameters.
Rejected Approaches