Study · March 2026
How to Clean Up All the Parameter Golf Submissions
Parameter Golf is one of the most fun open problems in ML right now — compress language into 16 MB. Recently, n-gram caching pushed reported scores below 0.5 BPB. We dug into the numbers and found something interesting: the probability distribution sums to ~277, not 1. A one-line check in the eval script would catch it. This study presents the math, the experimental evidence, and proposed fixes.
The distribution doesn't sum to 1.
Credit to @Eppie for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation.
N-gram caching blends a hash-table ratio with the base model's prediction. The blend is only computed for the correct token. The other 1,023 tokens are never checked. If they were, the distribution would sum to ~277, not 1.0.
The cache stores two hash tables per n-gram order: one counts how often each context appears, one counts how often each (context, token) pair appears. Their ratio — full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] — is meant to approximate P(tok | ctx). But because context-only and context+token hash to independent bucket indices, the ratio doesn't track token frequency. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0. This is P(cache_bin), not P(tok | ctx).
The blend (1-α) · p_model + α · P(cache_bin) with P(cache_bin) ≈ 1.0 pushes the correct token's probability up toward p_model + α(1 - p_model). For any token the model predicts better than uniform (p > 1/K), renormalization would strictly decrease its probability. The n-gram contribution doesn't wash out — it actively hurts.
The 1-bucket extreme: P(cache_bin) = T/T = 1.0 for every lookup. With α = 1, BPB = 0. Perfect score. Which tells us the metric isn't measuring what we think.
The 256M-bucket result (1.1123) is near the float baseline (1.1109), suggesting the genuine contribution of collision-free n-grams is small. The sub-0.5 BPB scores are measurement artifacts from point-evaluating an invalid distribution.
The n-gram-only configuration — hash tables with no neural model — reports 1.0615, below the neural baseline at 1.1109. A frequency table with no learned parameters appears to outcompress a trained language model. This is only possible because the number being reported is not measuring compression.
Direct verification: compute the full distribution.
We ran the n-gram blend for all 1,024 tokens at every scored position and measured the sum. On a fresh 1×H100 base model (1.2711 BPB) with backoff 2-7, α=0.40:
| Metric | Value |
|---|---|
| Baseline BPB (no n-gram) | 1.2711 |
| Reported BPB (point-eval, unnormalized) | 0.5422 |
| Average distribution sum | 277.0 |
| Average NLL after normalization | 6.2906 |
The blended distribution sums to 277, not 1. After normalizing to a valid distribution, the average NLL is 6.29 — far worse than the baseline. The n-gram doesn't help; it actively hurts once you enforce valid probabilities. 62,020,312 positions audited.
One assert abs(probs.sum() - 1.0) < 1e-4 in the eval harness catches this instantly. Cost: one torch.sum per position, 1–2 seconds for 62M tokens.
How the n-gram cache works.
After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency is mixed with the prediction:
pmix = (1 − α) · pmodel + α · pngramThe tables are built from already-scored tokens. Causality is preserved in single-pass implementations. The technique builds 192–256 MB of hash tables during evaluation, none of which counts toward the 16 MB artifact limit.
Anatomy of the artifact.
The following experiments characterize how the measurement artifact behaves across configurations. The reported BPB numbers are from an invalid distribution — they measure how much P(cache_bin) inflates the correct token's probability, not how well the model compresses language.
Alpha sweep: more weight on the inflated ratio = lower reported BPB.
Higher α gives more weight to P(cache_bin) ≈ 1.0. The reported BPB drops monotonically. This isn't the model deferring to better predictions — it's the blend assigning more weight to the inflated hash ratio.
Order scaling: reported BPB vs max order.
Each additional n-gram order adds another pair of hash tables and changes which order's hash ratio is used for each token. The reported BPB drops with more orders but saturates around 9–12.
Stride decomposition: the artifact magnitude is ~0.62 BPB.
The n-gram “delta” is ~0.62 BPB regardless of sliding window stride. This is the artifact magnitude, not a compression improvement.
8-GPU all-reduce: the artifact fits in the time budget.
With all-reduce sync of hash table deltas, every GPU has the full global cache. Backoff 2-7 at α=0.40 finishes in 401 s, well under the 600 s budget. The inflated scores are not just theoretically possible — they're practically achievable within competition constraints.
A separate question: what should the competition measure?
The distribution issue above is a measurement bug. What follows is a different kind of question: even with valid scores, should the competition constrain what happens at eval time? This is a design conversation, not a mathematical one. Reasonable people can disagree.
The competition gives evaluation 8× H100 and 600 s for a 16 MB model. In deployment, inference is constrained by hardware cost, latency, and concurrent users. These are different environments, and it's interesting to think about whether the competition should reflect deployment realities or stay focused on pure compression research.
With valid distributions and preserved causality, someone could still train a second, larger model during eval via self-distillation, ensemble 8 copies via divergent TTT, or store 63 GB of hidden states as a neural cache. All valid. All causal. All far beyond 16 MB. The eval-time state spectrum shows the scale:
Proposed fixes.
1. Verify the distribution sums to 1.
fixing the measurementRequire the model to produce a full probability vector over all 1024 tokens at every scored position. The eval script verifies sum(probs) ≈ 1.0 before scoring. One torch.sum per position. Cost: 1–2 seconds for 62M tokens. Negligible.
probs = model.predict(context) # [vocab_size] assert abs(probs.sum() - 1.0) < 1e-4 # verify nll = -torch.log(probs[correct_token])
Catches every invalid distribution. Passes everything valid: softmax outputs, linear interpolation of valid distributions, Dirichlet-Multinomial, TTT, LoRA, GPTQ. Not n-gram specific.
2. Make causality an explicit rule.
aligning with realityThe FAQ says you can only train on tokens “you've already evaluated your model on.” Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates this: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Making this a stated rule would clarify the intent.
3. Cap auxiliary eval-time state.
aligning with realityConstrain auxiliary state: tensors that accumulate during eval and are not derivable from the artifact alone. Not model weights, not KV cache, not activations. A cap of ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB).
4. Cap per-token overhead.
aligning with realityEval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8× H100 takes 110 s. A 1.5× cap means 165 s max. The n-gram cache takes 401 s (3.6×).
Full Results
All BPB numbers below are from an invalid distribution. They measure how much P(cache_bin) inflates the correct token's probability, not how well the model compresses language.
Single-GPU configurations.
Base model: PR #728 (~1.12 BPB, 16 MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).
Other observations.
Logistic mixing inflates less than linear
PAQ-style logistic mixing gives 0.75 (reported) vs linear’s 0.65. The logistic transform compresses the inflated ratio, reducing the artifact magnitude.
Entropy-adaptive alpha reduces the artifact
The sigmoid-gated alpha from PR #727 gives 0.65 vs fixed α=0.40 at 0.49. The entropy gate reduces α when the model is confident, which partially corrects the inflation.
Three match categories
The cache matches (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). With valid distributions, the genuine contribution of these matches appears small (256M-bucket result: 1.1123 vs 1.1109 baseline).