Skip to content

12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76

Open
unixmadtoonslab wants to merge 6 commits intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding
Open

12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)#76
unixmadtoonslab wants to merge 6 commits intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding

Conversation

@unixmadtoonslab
Copy link
Copy Markdown

@unixmadtoonslab unixmadtoonslab commented Mar 19, 2026

Summary

12-layer transformer (dim=512, 8H/4KV GQA) achieving 1.14327 BPB (3-seed mean: 1.14375 / 1.14316 / 1.14289).

Key techniques

  • Mixed int5/int6 quantization: int5 per-row for MLP weights (3 zero high bits -> better zstd-22 compression ~3.8x), int6 per-row for attention weights, fp16 embedding passthrough
  • 12 layers (funded by ~1MB saved from int5 compression)
  • SmearGate: per-dim sigmoid gate blending token with previous token embedding
  • BigramHash: hash embedding for token-pair context (2048 buckets, dim=96)
  • U-Net skip connections: encoder-decoder split with learned per-dim skip weights
  • Orthogonal init with 1/sqrt(2*num_layers) output projection scaling
  • SWA: checkpoint averaging every 50 steps during warmdown
  • Muon optimizer: LR=0.025, momentum=0.98, WD=0.04
  • Warmdown timing fix: ignores torch.compile overhead in step-time estimation
  • PyTorch 2.8 SDPA: ~66ms/step at seq1024 (12 layers)

Results

Seed val_bpb Artifact Size
1337 1.14375 15.80 MB
42 1.14316 15.77 MB
7 1.14289 16.01 MB
Mean 1.14327

Config

NUM_LAYERS=12, MODEL_DIM=512, NUM_HEADS=8, NUM_KV_HEADS=4, MLP_MULT=3
TRAIN_SEQ_LEN=1024, EVAL_SEQ_LEN=1024, TRAIN_BATCH_TOKENS=524288
MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
MUON_MOMENTUM=0.98, MUON_WD=0.04, ADAM_WD=0.0
WARMDOWN_ITERS=2000, SWA_INTERVAL=50
BIGRAM_VOCAB_SIZE=2048, BIGRAM_DIM=96
MIXED_INT5_ENABLED=1

Test plan

  • 3-seed validation (seeds 1337, 42, 7)
  • All artifacts under 16MB (16 MiB)
  • Quantized roundtrip validation

Submission by Will DePue

@unixmadtoonslab unixmadtoonslab changed the title [WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160) Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window (val_bpb 1.1599) Mar 20, 2026
@unixmadtoonslab
Copy link
Copy Markdown
Author

This submission is ready for review. Merge conflicts have been resolved.

val_bpb: 1.1599 (3-seed mean on 8xH100 SXM, 10 min)

The 'not ready for review' label is stale — I don't have permissions to remove it. This is a complete submission with validated results.

@unixmadtoonslab
Copy link
Copy Markdown
Author

Compute Credits Feedback

I want to flag a significant issue with the compute credit program for this challenge.

OpenAI advertises $1,000,000 in compute credits to help participants. I applied for credits through the official form requesting $500 — a modest amount for a competition that requires 8xH100 SXM pods at ~$20/hr. I received $25.

$25 buys roughly one single training run on 8xH100s. That's not enough to even validate a baseline, let alone iterate on ideas. For context, I've spent over $300 out of my own pocket just to develop and test the techniques in this submission (QAT scheduling, int6 quantization, ablation studies across dozens of runs).

I understand compute is expensive and there are many participants, but $25 out of a $1M pool feels disconnected from the stated goal of helping people "get started training their models." A single 8xH100 validation run costs more than the entire grant.

Is this the level of support OpenAI intended for active participants who are pushing the leaderboard? I'd appreciate clarity on the credit allocation process, or an increase that would allow meaningful experimentation.

unixmadtoonslab and others added 3 commits March 20, 2026 10:10
… ~1.160)

Four orthogonal improvements stacked: int6 mixed-precision quantization on
MLP+attention weights with zstd-22 compression, 3x MLP expansion, fp16 tied
embedding passthrough, and sliding window evaluation. Awaiting 8xH100 SXM
compute credits for official run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key improvements over baseline:
- Delayed QAT: STE fake-quantization only in last 15% of training time,
  allowing model to train at full precision before adapting to quantization
- Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31]
- Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup
- Sliding window eval with stride=64 for better BPB measurement
- fp16 embedding passthrough (tok_emb kept unquantized)

3-seed validation (seeds 1337, 42, 7):
  1.15924, 1.15980, 1.16066 → mean 1.15990 BPB

Beats current openai#1 (PR openai#88) at 1.1605 BPB.
@unixmadtoonslab unixmadtoonslab force-pushed the submission/int6-wider-mlp-fp16embed-sliding branch from f9ab40a to 0d64b28 Compare March 20, 2026 10:11
@unixmadtoonslab unixmadtoonslab changed the title Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window (val_bpb 1.1599) Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window + SmearGate + BigramHash + SWA + OrthoInit (val_bpb 1.1568) Mar 20, 2026
@unixmadtoonslab unixmadtoonslab changed the title Int6 QAT + Wider MLP 3x + FP16 Embed + Sliding Window + SmearGate + BigramHash + SWA + OrthoInit (val_bpb 1.1568) Int6 11L + SmearGate + BigramHash + SWA + OrthoInit + MuonWD (val_bpb 1.1555) Mar 20, 2026
graalolwest and others added 2 commits March 20, 2026 15:12
…BPB)

Major improvements over v6 baseline (1.1599 -> 1.1555 BPB):
- 11 layers with orthogonal init (1/sqrt(2*N) output scaling)
- SmearGate: blend token embeddings with previous token via learned gate
- BigramHash: 4096-bucket hash embedding for token-pair context
- Stochastic Weight Averaging during warmdown (interval=100)
- Separate Muon/Adam weight decay (muon_wd=0.04, adam_wd=0.0)
- FA3/SDPA dual attention path with NTK-aware RoPE
- GQA support (8 heads, 4 KV heads)
- QAT fraction configurable (disabled by default - fixes STE bug)
- Higher LR (0.03) with lower momentum (0.97)
- All hyperparameters configurable via environment variables

3-seed validation: 1.15520, 1.15492, 1.15649 (mean 1.15554)
Artifact size: ~14.5MB (1.5MB headroom under 16MB limit)
…B, 3-seed mean)

Key improvements over v8d (1.1555):
- 12 layers (was 11) funded by int5 compression savings
- Mixed int5/int6 quantization: int5 for MLP weights (better zstd ratio)
- LR=0.025, momentum=0.97, warmdown=2000, SWA/50
- Warmdown fix: ignore torch.compile overhead in step timing
- PyTorch 2.8 SDPA: ~59ms/step at seq1024 (was 81ms on 2.4)

3-seed validation: 1.14618 / 1.14768 / 1.14641 = mean 1.14676

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@unixmadtoonslab unixmadtoonslab changed the title Int6 11L + SmearGate + BigramHash + SWA + OrthoInit + MuonWD (val_bpb 1.1555) 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1468) Mar 20, 2026
Major improvement from momentum 0.97→0.98 and reduced bigram to fit 16MB.
3-seed: 1.14375 / 1.14316 / 1.14289 = mean 1.14327 BPB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@unixmadtoonslab unixmadtoonslab changed the title 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1468) 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) Mar 21, 2026
@cocohearts
Copy link
Copy Markdown
Collaborator

@unixmadtoonslab we'll be handing out higher compute credit grants shortly

@cocohearts
Copy link
Copy Markdown
Collaborator

@unixmadtoonslab pls clean ur diff to only include sota submission

chrislovescoding added a commit to chrislovescoding/parameter-golf that referenced this pull request Mar 25, 2026
Int6 per-row quantization (QUANT_RANGE=31) + zstd-22 compression
fits MLP 3x in 16MB. seq1024 for max steps (~12K on 8xH100).
Sliding window stride=64. Muon 0.99, LR=0.02, warmdown=3000.
FP16 embedding. No QAT (overhead not worth it per PR openai#76).
Targets ~1.16 BPB matching top submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants