Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean) by RoyiRa · Pull Request #913 · openai/parameter-golf

RoyiRa · 2026-03-27T00:12:28Z

Cache Is All You Need

val_bpb: 0.0887 (3-seed mean) | 622 KB artifact | 8xH100 SXM

I started from the competition baseline train_gpt.py and made only a minimal integration change: 36 added lines plus one new file, ngram_cache.py (295 lines). The baseline trains a tiny 2-layer, 128d vanilla GPT; my addition is a compact eval-time n-gram and phrase cache layer with adaptive blending.

The result is 0.0887 BPB in a 622 KB artifact.

Results (8xH100 80GB SXM)

Seed	Pre-Cache BPB	Final BPB	Artifact	Train time	Eval time
1337	1.7788	0.0883	622 KB	122s	403.3s
42	1.7848	0.0891	622 KB	122s	406.0s
7	1.7788	0.0887	622 KB	122s	~403s
Mean	1.7808	0.0887	622 KB

Transformer Configuration

The baseline train_gpt.py with these env var overrides:

NUM_LAYERS=2 MODEL_DIM=128 NUM_HEADS=4 NUM_KV_HEADS=2 MLP_MULT=2

Parameter	Value
Layers	2
Model dim	128
Attention heads	4
KV heads	2 (GQA)
Head dim	32
MLP multiplier	2× (256 hidden)
Vocab size	1024
Sequence length	1024
Embeddings	Tied
Logit softcap	30.0
RoPE base	10000
Optimizer	Muon (baseline default)
Quantization	int8 + zlib (baseline default)
Total params	~500K
Compressed model	~558 KB

Changes to the baseline

36 lines added to train_gpt.py:

1 import: from ngram_cache import eval_val_with_cache
18 lines: forward_logits() method on GPT (returns logits without computing loss)
11 lines: cache eval call at the end of main()
6 lines: whitespace and comments

One new file, ngram_cache.py (295 lines):

NgramEvalCache: order 2-12 backoff with order-adaptive entropy gating
LongPhraseCache: phrase probes at lengths [64, 56, 48, 36, 28, 20, 16]
eval_val_with_cache(): sliding window eval with cache blending

How it works

For each scored token:

Model produces logits → softmax → p_model
N-gram cache: hash the preceding 2-12 tokens, look up frequency → p_ngram
Phrase cache: hash the preceding 16-64 tokens, look up frequency → p_phrase
Blend in two stages:
- first with the n-gram cache
- then with the phrase cache on top
Cache weight adapts per token:
- n-gram weight depends on match order and model entropy
- phrase weight depends on phrase length and model entropy

Caches are updated online from already-scored tokens only. After a chunk is fully scored, it is added to the caches before scoring later chunks.

Compliance

Constraint	Limit	Actual	Status
Train time	600s	122s	Pass
Eval time	600s	406s (worst seed)	Pass
Artifact	16,000,000 bytes	621,760 bytes	Pass (4%)
Score-first	—	Caches updated from already-scored tokens only	Pass
No external downloads	—	All cache built at eval time	Pass

Reproduction

DATA_PATH=../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
NUM_LAYERS=2 MODEL_DIM=128 NUM_HEADS=4 NUM_KV_HEADS=2 MLP_MULT=2 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

File	Lines	Purpose
`train_gpt.py`	1162	Competition baseline + 36 lines of integration
`ngram_cache.py`	295	N-gram cache, phrase cache, sliding window eval

…d mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mhuen · 2026-03-27T04:19:39Z

I think the computation of the mixture model is incorrect here. Due to collisions in the hash table, p_ng is not a properly normalized probability, which in turn means that the computed mixture model is incorrect. It is artificially boosting the prob for the token you are evaluating on.

Counter example: imagine you only had one hash bin. Then p_ng would be 1 and with alpha=1 you would get the perfect (incorrect) score.

The numbers I see with n-grams is that they produce worse results on their own than the baseline if computed without the aforementioned collisions. That said, I think the n-grams can still be very helpful as a prior in combination with the language model if done right.

TLDR: I think there is a bug in the evaluation script that artificially boosts scores

Replaces simple bigram mixing with battle-tested architecture from PRs openai#913/openai#907/openai#888 (0.09-0.10 BPB proven): - Order 2-12 hash-based backoff tables (XOR of token*prime) - np.bincount vectorized updates (10-50x faster than np.add.at) - Two-pass: (1) neural scoring + cache build, (2) full rescore - Entropy-adaptive alpha with per-order multipliers - Temperature sharpening (0.85) - 352MB RAM, ~83s total eval time Expected: sub-0.2 BPB (from current 1.1190) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-03-27T04:54:23Z

This is one of the most honest submissions in the competition. A 2-layer 128d model at 1.78 BPB, 622KB artifact, and the cache brings it to 0.0887 — really drives home how much of the BPB improvement across all submissions is coming from the cache layer rather than the neural architecture.

The 36-line integration is clean and the phrase cache with variable probe lengths (16-64) on top of the n-gram backoff is a nice touch.

Two small things:

Seed 7 — the competition standard is seeds 42, 1337, and 2024. Might want to swap in 2024 for the third seed before review.
This is probably the strongest evidence yet that the n-gram cache is doing the heavy lifting across all cache-based submissions. A 500K param model shouldn't be within striking distance of 26M param models — but with the same cache, it basically is. Interesting implications for how the leaderboard should be interpreted.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

zlxi02 · 2026-03-27T19:15:26Z

lmfao this is great

valerio-oai · 2026-03-27T22:53:05Z

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

MAJOR REWRITE — match top competition approach: - Shrink neural model to 2L/128d (~0.5MB compressed) - Build n-gram tables from ALL training shards during training - Store uint16-capped tables in artifact (training-data statistics) - Pre-warm eval cache with training n-gram tables - 300s train + n-gram build, 600s eval budget Inspired by openai#944 (0.0165), openai#933 (0.0804), openai#913 (0.0887). The neural model is now irrelevant — the cache does the work.

All cache targets (openai#868, openai#913, openai#933) were closed by the organizer. Retarget operator to PR openai#549 (accepted SOTA) and PR openai#1019. Sync upstream code, create run specs, update policy and campaign. Rewrite grant application for $500 development tier. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-see…

64c9d07

…d mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 27, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Eppie mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

haikosys mentioned this pull request Mar 27, 2026

Record: CacheMoney — 0.0804 BPB (3-seed mean, std 0.00003) #933

Open

MatoTeziTanka mentioned this pull request Mar 27, 2026

Record: Frozen N-gram Oracle (Order-16) + Score-First TTT (0.02807 BPB) #925

Closed

mhuen mentioned this pull request Mar 27, 2026

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900

Closed

callithyia mentioned this pull request Mar 27, 2026

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks #961

Open

4 tasks

haikosys mentioned this pull request Mar 27, 2026

Record: Fort Knox — Legal Packed Training Cache, Zero Val Adaptation (val_bpb 0.0638, 3-seed) #982

Closed

valerio-oai closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)#913

Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)#913
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission/2026-03-27-cache-is-all-you-need

RoyiRa commented Mar 27, 2026 •

edited

Loading

Uh oh!

mhuen commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Mar 27, 2026 •

edited

Loading

Uh oh!

zlxi02 commented Mar 27, 2026 •

edited

Loading

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

RoyiRa commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cache Is All You Need

Results (8xH100 80GB SXM)

Transformer Configuration

Changes to the baseline

How it works

Compliance

Reproduction

Files

Uh oh!

mhuen commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zlxi02 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RoyiRa commented Mar 27, 2026 •

edited

Loading

MatoTeziTanka commented Mar 27, 2026 •

edited

Loading

zlxi02 commented Mar 27, 2026 •

edited

Loading