Skip to content

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162

Merged
cocohearts merged 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA
Mar 20, 2026
Merged

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
cocohearts merged 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

Conversation

@raahilshah
Copy link
Copy Markdown

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)

Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).

Key Techniques

  1. Per-row int6 quantization ([-32,31]) on MLP + attention weights, fp16 passthrough for tied embeddings and last-layer key projection. zstd level 22 compression.
  2. 3× MLP expansion (hidden=1536) — enabled by int6 byte savings. Single largest improvement source.
  3. SmearGate — learned gate blending each token embedding with the previous token's (~512 params).
  4. BigramHash embedding — 4096-bucket hash table (dim=128→512) for token-pair context (~524K params).
  5. Orthogonal init + muP scaling — orthogonal weight init, output projections scaled by 1/√(2·num_layers).
  6. Muon WD=0.02 with momentum warmup 0.92→0.99 over 1500 steps. AdamW WD=0.01 for embeddings/scalars.
  7. SWA over last 50% of training (every 200 steps) — smoother weights, better quantization.

Hyperparameters

9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.

Metrics

Seed val_loss val_bpb
1337 1.93978 1.14885
42 1.93923 1.14852
7 1.93762 1.14757
Mean 1.93888 1.14831
  • Pre-quant val_bpb: 1.1640
  • Steps: 7,373 in 600s (81.4 ms/step)
  • Artifact: 15.92MB (int6+zstd-22)
  • Improvement over current SOTA (1.1748): -0.0265 bpb / -0.046 nats

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14cdf6f7a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

).reshape(bsz, seq_len)
for i, ws in enumerate(batch_ws):
wlen = wlens[i]
s = 0 if ws == 0 else max(wlen - stride, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-counting tail tokens in sliding eval

The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.

Useful? React with 👍 / 👎.

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458
@raahilshah
Copy link
Copy Markdown
Author

Updated submission — improved configuration after systematic hyperparameter sweeps:

  • Muon weight decay: 0.02 → 0.04 (swept 0.01–0.05, optimal at 0.04)
  • SWA frequency: every 200 steps → every 50 steps (swept 25–200, optimal at 50; ~30 checkpoint average)

New results (3 seeds):

Seed val_loss val_bpb
1337 1.93492 1.14597
42 1.93591 1.14656
7 1.93314 1.14492
Mean 1.93466 1.14582

Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone).

kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params)
  USE_SMEAR_GATE=1 to enable
- BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512)
  USE_BIGRAM_HASH=1 to enable (~524K params)
- Both disabled by default for backward compatibility
- forward_with_adapter refactored to reuse _forward_body
kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
- STE QAT: fake quantize->dequantize in CastedLinear forward pass
  Gradients pass through via STE (w + (w_hat - w).detach())
  Activates after STE_QAT_START_FRAC of training (default 25%)
  USE_STE_QAT=1 to enable
- forward_with_adapter refactored to reuse _forward_body
- All Tier 2 features are env-var controlled, disabled by default
@cocohearts cocohearts merged commit 8b2b17e into openai:main Mar 20, 2026
SkywardSyntax pushed a commit to SkywardSyntax/parameter-golf that referenced this pull request Mar 20, 2026
…1.3260 BPB

Key improvements to train_exp.py:
- BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162)
- SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction
- SWA early-start bug fix (minimum 100 steps before activation)
- FTLE-lite sensitivity-aware mixed-precision quantization (experimental)
- Eval-time extra recurrence support (not useful for non-shared models)
- Sliding window eval safety: skips if estimated time > 600s

Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22)
Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 21, 2026
Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate,
BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds
TTT LoRA evaluation. TTT passes base_model directly (compiled).

If TTT works on this architecture: expected ~1.11-1.12 bpb (new record).
If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HyperPotatoNeo added a commit to HyperPotatoNeo/parameter-golf that referenced this pull request Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264),
MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048),
SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer.
Single-seed result (seed=1337), ~8903 steps on 8xH100.
romainsantoli-web pushed a commit to romainsantoli-web/parameter-golf that referenced this pull request Mar 21, 2026
…its)

Combines techniques from PR openai#162, openai#180, openai#267, openai#281:
- 11-layer GPT with U-Net skip connections, GQA
- SmearGate + BigramHash(10240)
- Mixed int5/int6 quantization + 3% magnitude pruning
- Causal TTT at eval time
- SWA(frac=0.4), WD=0.042, Z-loss
- Target: sub-1.135 val_bpb

Awaiting RunPod 8xH100 credits for 3-seed validation.
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 21, 2026
Key changes from PR openai#162 base:
- 11 layers (from 9) — enabled by int6 compression headroom
- Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs
  over val data, freeze first 2 blocks for stability
- NTK-RoPE base=50000 (from 10000) for long-context extrapolation
- matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035
- weight_decay=0.04 (from 0.01)
- BigramHash 2048 buckets (from 4096)
- TTT_ENABLED=1 env var toggle

Target: match FarnsworthEngine's 1.1303 bpb or beat it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 22, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 22, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants