Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
Conversation
… (mean val_bpb=1.1483, 3 seeds)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14cdf6f7a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ).reshape(bsz, seq_len) | ||
| for i, ws in enumerate(batch_ws): | ||
| wlen = wlens[i] | ||
| s = 0 if ws == 0 else max(wlen - stride, 0) |
There was a problem hiding this comment.
Avoid double-counting tail tokens in sliding eval
The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.
Useful? React with 👍 / 👎.
…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458
|
Updated submission — improved configuration after systematic hyperparameter sweeps:
New results (3 seeds):
Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone). |
- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params) USE_SMEAR_GATE=1 to enable - BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512) USE_BIGRAM_HASH=1 to enable (~524K params) - Both disabled by default for backward compatibility - forward_with_adapter refactored to reuse _forward_body
- STE QAT: fake quantize->dequantize in CastedLinear forward pass Gradients pass through via STE (w + (w_hat - w).detach()) Activates after STE_QAT_START_FRAC of training (default 25%) USE_STE_QAT=1 to enable - forward_with_adapter refactored to reuse _forward_body - All Tier 2 features are env-var controlled, disabled by default
…1.3260 BPB Key improvements to train_exp.py: - BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162) - SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction - SWA early-start bug fix (minimum 100 steps before activation) - FTLE-lite sensitivity-aware mixed-precision quantization (experimental) - Eval-time extra recurrence support (not useful for non-shared models) - Sliding window eval safety: skips if estimated time > 600s Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22) Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.
…its) Combines techniques from PR openai#162, openai#180, openai#267, openai#281: - 11-layer GPT with U-Net skip connections, GQA - SmearGate + BigramHash(10240) - Mixed int5/int6 quantization + 3% magnitude pruning - Causal TTT at eval time - SWA(frac=0.4), WD=0.042, Z-loss - Target: sub-1.135 val_bpb Awaiting RunPod 8xH100 credits for 3-seed validation.
Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA
Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)
Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).
Key Techniques
Hyperparameters
9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.
Metrics