Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)#913
Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)#913RoyiRa wants to merge 1 commit intoopenai:mainfrom
Conversation
…d mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I think the computation of the mixture model is incorrect here. Due to collisions in the hash table, Counter example: imagine you only had one hash bin. Then The numbers I see with n-grams is that they produce worse results on their own than the baseline if computed without the aforementioned collisions. That said, I think the n-grams can still be very helpful as a prior in combination with the language model if done right. TLDR: I think there is a bug in the evaluation script that artificially boosts scores |
Replaces simple bigram mixing with battle-tested architecture from PRs openai#913/openai#907/openai#888 (0.09-0.10 BPB proven): - Order 2-12 hash-based backoff tables (XOR of token*prime) - np.bincount vectorized updates (10-50x faster than np.add.at) - Two-pass: (1) neural scoring + cache build, (2) full rescore - Entropy-adaptive alpha with per-order multipliers - Temperature sharpening (0.85) - 352MB RAM, ~83s total eval time Expected: sub-0.2 BPB (from current 1.1190) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This is one of the most honest submissions in the competition. A 2-layer 128d model at 1.78 BPB, 622KB artifact, and the cache brings it to 0.0887 — really drives home how much of the BPB improvement across all submissions is coming from the cache layer rather than the neural architecture. The 36-line integration is clean and the phrase cache with variable probe lengths (16-64) on top of the n-gram backoff is a nice touch. Two small things:
Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
lmfao this is great |
|
Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future! |
MAJOR REWRITE — match top competition approach: - Shrink neural model to 2L/128d (~0.5MB compressed) - Build n-gram tables from ALL training shards during training - Store uint16-capped tables in artifact (training-data statistics) - Pre-warm eval cache with training n-gram tables - 300s train + n-gram build, 600s eval budget Inspired by openai#944 (0.0165), openai#933 (0.0804), openai#913 (0.0887). The neural model is now irrelevant — the cache does the work.
All cache targets (openai#868, openai#913, openai#933) were closed by the organizer. Retarget operator to PR openai#549 (accepted SOTA) and PR openai#1019. Sync upstream code, create run specs, update policy and campaign. Rewrite grant application for $500 development tier. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache Is All You Need
val_bpb: 0.0887 (3-seed mean) | 622 KB artifact | 8xH100 SXM
I started from the competition baseline
train_gpt.pyand made only a minimal integration change: 36 added lines plus one new file,ngram_cache.py(295 lines). The baseline trains a tiny 2-layer, 128d vanilla GPT; my addition is a compact eval-time n-gram and phrase cache layer with adaptive blending.The result is 0.0887 BPB in a 622 KB artifact.
Results (8xH100 80GB SXM)
Transformer Configuration
The baseline
train_gpt.pywith these env var overrides:Changes to the baseline
36 lines added to
train_gpt.py:from ngram_cache import eval_val_with_cacheforward_logits()method on GPT (returns logits without computing loss)main()One new file,
ngram_cache.py(295 lines):NgramEvalCache: order 2-12 backoff with order-adaptive entropy gatingLongPhraseCache: phrase probes at lengths [64, 56, 48, 36, 28, 20, 16]eval_val_with_cache(): sliding window eval with cache blendingHow it works
For each scored token:
p_modelp_ngramp_phraseCaches are updated online from already-scored tokens only. After a chunk is fully scored, it is added to the caches before scoring later chunks.
Compliance
Reproduction
Files
train_gpt.pyngram_cache.py