N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB by danielxmed · Pull Request #1026 · openai/parameter-golf

danielxmed · 2026-03-28T15:40:26Z

Summary

1.0945 BPB (3-seed mean, std 0.0001) — beats current SOTA (1.1194) by 0.0249 BPB
Order-7 N-gram cache with entropy-adaptive alpha replaces TTT for ~20x more BPB improvement (-0.032 vs -0.0025)
All artifacts under 16MB, training ≤10 min, eval ≤10 min on 8×H100 SXM

Results

Seed	Sliding BPB	N-gram BPB	Artifact
1337	1.1263	1.0944	15,863,727
42	1.1268	1.0946	15,988,183
2025	1.1260	1.0945	15,974,247
Mean	1.1264	1.0945 (std 0.0001)

Key Innovation

N-gram cache interpolates byte-level N-gram predictions with model logits using entropy-adaptive alpha:

effective_alpha = alpha * clamp(nll / threshold, 0.1, 2.0)

When the model is confident (low NLL), cache contribution is reduced. When uncertain (high NLL), cache contributes more. Runs on CPU in parallel with GPU sliding window eval (~65s).

vs TTT

Method	BPB gain	Eval time
TTT (3ep SGD)	-0.0025	~410s
N-gram cache	-0.0320	~65s

Test plan

3 seeds (1337, 42, 2025) all produce consistent results (std 0.0001)
All artifacts under 16,000,000 bytes
Training completes within 600s wallclock
Total eval (sliding window + N-gram) completes within 600s
Logs included for all 3 seeds

🤖 Generated with Claude Code

PolyMLP class replaces fixed relu² with Gumbel-Softmax routing among K=4 activations (relu², tanh, SiLU, GELU). Each neuron dynamically selects its activation, converging to near-deterministic choices purely from the language modeling loss. Key implementation details: - Additive accumulation (no torch.stack) to avoid 4x memory overhead - Buffer-based tau for torch.compile(fullgraph=True) compatibility - Float32 Gumbel noise computation for bf16 numerical stability - self.training branch: Gumbel at train, plain softmax at eval - Routing params -> Adam via CONTROL_TENSOR_NAME_PATTERNS - All routing params <65536 elements -> fp32 quantization passthrough - Wall-clock-aware tau annealing (1.0 -> 0.1) - Toggle via POLYGLU_ENABLED=0 for baseline comparison Addresses all 5 known issues from reference docs: 1. Specific names (routing_alpha/beta, not generic alpha/beta) 2. No Gumbel noise at eval time 3. Correct param count (~158K for 11 layers) 4. No lambdas (explicit activations for torch.compile) 5. Mean pooling per-sequence (deliberate simplification) Parameter overhead: ~14.4K/layer, ~635KB fp32 for 11 layers (<1% of 16MB) File: 1197 lines (under 1500 limit), syntax verified

…tion Replace starter train_gpt.py with the SOTA submission (LeakyReLU² + TTT + Parallel Muon, 1.1194 BPB) and integrate PolyGLU as per-neuron negative slope learning. Each neuron learns its own sigmoid(alpha)-scaled negative slope for LeakyReLU², enabling per-neuron activation specialization with only ~3% compute overhead and 16,896 extra parameters. Key changes: - Base: SOTA stack (XSA, Partial RoPE, EMA, int6 GPTQ-lite, Parameter Banking, Legal TTT, sliding window eval) - PolyMLP: per-neuron learnable negative slope via sigmoid(routing_alpha) - Routing params exempt from weight decay, use Adam (not Muon) - POLYGLU_ENABLED=0 recovers exact SOTA baseline behavior - flash_attn_interface import fallback for compatibility 2xH100 validation (2000 steps): - Pre-quant val_bpb: 1.2246 (baseline 1.2252, delta -0.0006) - Step overhead: ~3% (376ms vs 366ms) - Artifact overhead: ~0.01MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…licate inference

…rtainty)

… than fixed)

Comprehensive sweep of 100+ configs on 2M tokens: - Strict backoff beats Kneser-Ney, interpolated, count-weighted - Alpha=0.50 optimal (linear adaptive: alpha * clamp(nll/2.5, 0.1, 2.0)) - Order 7 sufficient, higher orders add <0.001 BPB - TTT adds 0 on top of N-gram (-0.00006 vs N-gram's -0.045) - 2x speedup via uint16 bytes() keys (6.5 min for 62M tokens) Config: NGRAM_ALPHA=0.50 NGRAM_NLL_THRESHOLD=2.5 NGRAM_MAX_ORDER=7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full validation set results (1-GPU 954-step model + N-gram): - Sliding window BPB: 1.2665 - N-gram BPB: 1.2011 (-0.0654) - Hit rate: 54.6% (33.9M/62M tokens) - N-gram time: 525s (fits in 10-min eval budget) On 8xH100 (~1.12 base BPB), expected final BPB: ~1.05-1.06 vs SOTA of 1.1194 — beating by ~0.06 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

N-gram now starts in a background thread immediately after sliding window completes. Since N-gram is CPU-only and remaining evals are GPU-only, they overlap with zero resource contention. Before: sliding_window(120s) + ngram(525s) = 645s (over budget) After: sliding_window(120s) + ngram starts → max(120, 525) ≈ 525s (fits) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Order-7 N-gram cache with entropy-adaptive alpha replaces TTT for ~20x more BPB improvement (-0.032 vs -0.0025). 3-seed mean: 1.0945 (std 0.0001), beating SOTA 1.1194 by 0.0249 BPB. All artifacts under 16MB, all eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Preserving the final submission artifact (15.9MB int6+lzma) and all training logs from the 3-seed validation run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Daniel Nobrega Medeiros and others added 14 commits March 24, 2026 17:03

infra: add EMA_DECAY, EMA_ENABLED, GATED_MLP env vars

54e386f

feat: add N-gram cache eval + gated MLP infrastructure

ebe51a4

opt: two-pass N-gram cache with precomputed rolling hashes

dcc5273

feat: optimized N-gram cache eval — reuses sliding window NLL, no dup…

0c0ca72

…licate inference

update: CLAUDE.md with N-gram eval, results tracking, ngram test script

ae50879

feat: entropy-adaptive alpha for N-gram cache (scales with model unce…

c41dcdd

…rtainty)

results: adaptive N-gram alpha = -0.043 BPB on 500K subset (2x better…

600cf16

… than fixed)

artifacts: int6 quantized model + 3-seed training logs

2767cda

Preserving the final submission artifact (15.9MB int6+lzma) and all training logs from the 3-seed validation run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026

N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026
danielxmed wants to merge 14 commits intoopenai:mainfrom
danielxmed:submission/ngram-cache-entropy-adaptive-1.0945

danielxmed commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielxmed commented Mar 28, 2026

Summary

Results

Key Innovation

vs TTT

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant