Skip to content

N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026

Open
danielxmed wants to merge 14 commits intoopenai:mainfrom
danielxmed:submission/ngram-cache-entropy-adaptive-1.0945
Open

N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB#1026
danielxmed wants to merge 14 commits intoopenai:mainfrom
danielxmed:submission/ngram-cache-entropy-adaptive-1.0945

Conversation

@danielxmed
Copy link
Copy Markdown

Summary

  • 1.0945 BPB (3-seed mean, std 0.0001) — beats current SOTA (1.1194) by 0.0249 BPB
  • Order-7 N-gram cache with entropy-adaptive alpha replaces TTT for ~20x more BPB improvement (-0.032 vs -0.0025)
  • All artifacts under 16MB, training ≤10 min, eval ≤10 min on 8×H100 SXM

Results

Seed Sliding BPB N-gram BPB Artifact
1337 1.1263 1.0944 15,863,727
42 1.1268 1.0946 15,988,183
2025 1.1260 1.0945 15,974,247
Mean 1.1264 1.0945 (std 0.0001)

Key Innovation

N-gram cache interpolates byte-level N-gram predictions with model logits using entropy-adaptive alpha:

effective_alpha = alpha * clamp(nll / threshold, 0.1, 2.0)

When the model is confident (low NLL), cache contribution is reduced. When uncertain (high NLL), cache contributes more. Runs on CPU in parallel with GPU sliding window eval (~65s).

vs TTT

Method BPB gain Eval time
TTT (3ep SGD) -0.0025 ~410s
N-gram cache -0.0320 ~65s

Test plan

  • 3 seeds (1337, 42, 2025) all produce consistent results (std 0.0001)
  • All artifacts under 16,000,000 bytes
  • Training completes within 600s wallclock
  • Total eval (sliding window + N-gram) completes within 600s
  • Logs included for all 3 seeds

🤖 Generated with Claude Code

Daniel Nobrega Medeiros and others added 14 commits March 24, 2026 17:03
PolyMLP class replaces fixed relu² with Gumbel-Softmax routing among
K=4 activations (relu², tanh, SiLU, GELU). Each neuron dynamically
selects its activation, converging to near-deterministic choices purely
from the language modeling loss.

Key implementation details:
- Additive accumulation (no torch.stack) to avoid 4x memory overhead
- Buffer-based tau for torch.compile(fullgraph=True) compatibility
- Float32 Gumbel noise computation for bf16 numerical stability
- self.training branch: Gumbel at train, plain softmax at eval
- Routing params -> Adam via CONTROL_TENSOR_NAME_PATTERNS
- All routing params <65536 elements -> fp32 quantization passthrough
- Wall-clock-aware tau annealing (1.0 -> 0.1)
- Toggle via POLYGLU_ENABLED=0 for baseline comparison

Addresses all 5 known issues from reference docs:
1. Specific names (routing_alpha/beta, not generic alpha/beta)
2. No Gumbel noise at eval time
3. Correct param count (~158K for 11 layers)
4. No lambdas (explicit activations for torch.compile)
5. Mean pooling per-sequence (deliberate simplification)

Parameter overhead: ~14.4K/layer, ~635KB fp32 for 11 layers (<1% of 16MB)
File: 1197 lines (under 1500 limit), syntax verified
…tion

Replace starter train_gpt.py with the SOTA submission (LeakyReLU² + TTT +
Parallel Muon, 1.1194 BPB) and integrate PolyGLU as per-neuron negative
slope learning. Each neuron learns its own sigmoid(alpha)-scaled negative
slope for LeakyReLU², enabling per-neuron activation specialization with
only ~3% compute overhead and 16,896 extra parameters.

Key changes:
- Base: SOTA stack (XSA, Partial RoPE, EMA, int6 GPTQ-lite, Parameter
  Banking, Legal TTT, sliding window eval)
- PolyMLP: per-neuron learnable negative slope via sigmoid(routing_alpha)
- Routing params exempt from weight decay, use Adam (not Muon)
- POLYGLU_ENABLED=0 recovers exact SOTA baseline behavior
- flash_attn_interface import fallback for compatibility

2xH100 validation (2000 steps):
- Pre-quant val_bpb: 1.2246 (baseline 1.2252, delta -0.0006)
- Step overhead: ~3% (376ms vs 366ms)
- Artifact overhead: ~0.01MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive sweep of 100+ configs on 2M tokens:
- Strict backoff beats Kneser-Ney, interpolated, count-weighted
- Alpha=0.50 optimal (linear adaptive: alpha * clamp(nll/2.5, 0.1, 2.0))
- Order 7 sufficient, higher orders add <0.001 BPB
- TTT adds 0 on top of N-gram (-0.00006 vs N-gram's -0.045)
- 2x speedup via uint16 bytes() keys (6.5 min for 62M tokens)

Config: NGRAM_ALPHA=0.50 NGRAM_NLL_THRESHOLD=2.5 NGRAM_MAX_ORDER=7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full validation set results (1-GPU 954-step model + N-gram):
- Sliding window BPB: 1.2665
- N-gram BPB: 1.2011 (-0.0654)
- Hit rate: 54.6% (33.9M/62M tokens)
- N-gram time: 525s (fits in 10-min eval budget)

On 8xH100 (~1.12 base BPB), expected final BPB: ~1.05-1.06
vs SOTA of 1.1194 — beating by ~0.06 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
N-gram now starts in a background thread immediately after sliding window
completes. Since N-gram is CPU-only and remaining evals are GPU-only,
they overlap with zero resource contention.

Before: sliding_window(120s) + ngram(525s) = 645s (over budget)
After:  sliding_window(120s) + ngram starts → max(120, 525) ≈ 525s (fits)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Order-7 N-gram cache with entropy-adaptive alpha replaces TTT for
~20x more BPB improvement (-0.032 vs -0.0025). 3-seed mean: 1.0945
(std 0.0001), beating SOTA 1.1194 by 0.0249 BPB. All artifacts
under 16MB, all eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preserving the final submission artifact (15.9MB int6+lzma) and
all training logs from the 3-seed validation run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant