N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026
Open
danielxmed wants to merge 14 commits intoopenai:mainfrom
Open
N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026danielxmed wants to merge 14 commits intoopenai:mainfrom
danielxmed wants to merge 14 commits intoopenai:mainfrom
Conversation
PolyMLP class replaces fixed relu² with Gumbel-Softmax routing among K=4 activations (relu², tanh, SiLU, GELU). Each neuron dynamically selects its activation, converging to near-deterministic choices purely from the language modeling loss. Key implementation details: - Additive accumulation (no torch.stack) to avoid 4x memory overhead - Buffer-based tau for torch.compile(fullgraph=True) compatibility - Float32 Gumbel noise computation for bf16 numerical stability - self.training branch: Gumbel at train, plain softmax at eval - Routing params -> Adam via CONTROL_TENSOR_NAME_PATTERNS - All routing params <65536 elements -> fp32 quantization passthrough - Wall-clock-aware tau annealing (1.0 -> 0.1) - Toggle via POLYGLU_ENABLED=0 for baseline comparison Addresses all 5 known issues from reference docs: 1. Specific names (routing_alpha/beta, not generic alpha/beta) 2. No Gumbel noise at eval time 3. Correct param count (~158K for 11 layers) 4. No lambdas (explicit activations for torch.compile) 5. Mean pooling per-sequence (deliberate simplification) Parameter overhead: ~14.4K/layer, ~635KB fp32 for 11 layers (<1% of 16MB) File: 1197 lines (under 1500 limit), syntax verified
…tion Replace starter train_gpt.py with the SOTA submission (LeakyReLU² + TTT + Parallel Muon, 1.1194 BPB) and integrate PolyGLU as per-neuron negative slope learning. Each neuron learns its own sigmoid(alpha)-scaled negative slope for LeakyReLU², enabling per-neuron activation specialization with only ~3% compute overhead and 16,896 extra parameters. Key changes: - Base: SOTA stack (XSA, Partial RoPE, EMA, int6 GPTQ-lite, Parameter Banking, Legal TTT, sliding window eval) - PolyMLP: per-neuron learnable negative slope via sigmoid(routing_alpha) - Routing params exempt from weight decay, use Adam (not Muon) - POLYGLU_ENABLED=0 recovers exact SOTA baseline behavior - flash_attn_interface import fallback for compatibility 2xH100 validation (2000 steps): - Pre-quant val_bpb: 1.2246 (baseline 1.2252, delta -0.0006) - Step overhead: ~3% (376ms vs 366ms) - Artifact overhead: ~0.01MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive sweep of 100+ configs on 2M tokens: - Strict backoff beats Kneser-Ney, interpolated, count-weighted - Alpha=0.50 optimal (linear adaptive: alpha * clamp(nll/2.5, 0.1, 2.0)) - Order 7 sufficient, higher orders add <0.001 BPB - TTT adds 0 on top of N-gram (-0.00006 vs N-gram's -0.045) - 2x speedup via uint16 bytes() keys (6.5 min for 62M tokens) Config: NGRAM_ALPHA=0.50 NGRAM_NLL_THRESHOLD=2.5 NGRAM_MAX_ORDER=7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full validation set results (1-GPU 954-step model + N-gram): - Sliding window BPB: 1.2665 - N-gram BPB: 1.2011 (-0.0654) - Hit rate: 54.6% (33.9M/62M tokens) - N-gram time: 525s (fits in 10-min eval budget) On 8xH100 (~1.12 base BPB), expected final BPB: ~1.05-1.06 vs SOTA of 1.1194 — beating by ~0.06 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
N-gram now starts in a background thread immediately after sliding window completes. Since N-gram is CPU-only and remaining evals are GPU-only, they overlap with zero resource contention. Before: sliding_window(120s) + ngram(525s) = 645s (over budget) After: sliding_window(120s) + ngram starts → max(120, 525) ≈ 525s (fits) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Order-7 N-gram cache with entropy-adaptive alpha replaces TTT for ~20x more BPB improvement (-0.032 vs -0.0025). 3-seed mean: 1.0945 (std 0.0001), beating SOTA 1.1194 by 0.0249 BPB. All artifacts under 16MB, all eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preserving the final submission artifact (15.9MB int6+lzma) and all training logs from the 3-seed validation run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results
Key Innovation
N-gram cache interpolates byte-level N-gram predictions with model logits using entropy-adaptive alpha:
When the model is confident (low NLL), cache contribution is reduced. When uncertain (high NLL), cache contributes more. Runs on CPU in parallel with GPU sliding window eval (~65s).
vs TTT
Test plan
🤖 Generated with Claude Code