Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143
Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143simon-marcus wants to merge 2 commits intoopenai:mainfrom
Conversation
Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
|
Interesting submission! A few findings from a close read of the code: The Scylla submission's TTT implementation follows the score-first pattern correctly and satisfies conditions 1-4 of #1017. The architecture and quantization appear to be clean. My concern is with the BPB calculation under the custom tokenizer. TokenMonster uses modifier tokens (IDs 36, 37, 38, 151, 152) that alter the byte output of the following token. For example, tokens 37, 151, and 152 delete the leading space from the next token, reducing its decoded byte length by one. The per-token metadata in Concretely, if token 151 (capitalize + delete-space) is followed by token 305 ( Across representative text, the metadata-derived byte count exceeds the true decoded byte count by approximately 6%. Applied to the claimed 1.0806:
With correct byte accounting, the submission does not beat the previously merged leader. So this is a "rule 15" issue: the custom tokenizer's BPB calculation is not correct. The Separately, the tokenizer cannot represent 131 of 256 possible byte values (including 101 high bytes needed for UTF-8). Non-ASCII input is handled via NFC-to-NFD Unicode decomposition, which changes the byte representation and introduces a second source of byte-count divergence from the original text. Whether this materially affects the score depends on the non-ASCII content in the FineWeb validation slice. The tokenizer direction itself is worth pursuing. If the byte accounting is fixed and the score still lands competitively, this would be a strong contribution. |
|
Great work on the tokenizer exploration — this is one of the most creative directions in the competition, and the autoresearch methodology is thorough. I wanted to flag a byte-accounting issue that affects the reported
(Also flagged in Issue #897 as a known risk with custom tokenizers.) The issue: The submitted
I verified on the full FineWeb val set: # Ground truth: decode all SP1024 val tokens → count UTF-8 bytes
ground_truth_bytes = 151,080,633
# Scylla tokenization with current meta.npz (all zeros):
sum(base_bytes[t]) = 157,319,779 (+4.13% overcount)
# With corrected meta.npz (38 boundary + 27 byte-token fixes):
corrected_byte_count = 151,040,811 (-0.026% — essentially exact)Since
The fix is straightforward: detect capcode-D tokens via decode testing and mark as cc @0hq @valerio-oai — this may be relevant for reviewing TokenMonster-based submissions. |
|
@dexhunter @NoesisGenesis Thank you both for the thorough and thoughtful responses here. I am digging into the details now -- will decide whether this is a "revise and resubmit" or "fight me, dammit" situation shortly. |
|
I need to check in on the byte accounting further -- and I don't want to leave this PR open with potentially dirty data, so I'm closing for now. |
Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)
Results
Against the currently accepted leader #549 at
1.1194, this is an improvement of0.03883447BPB, or about3.47%.Summary
This submission combines three ideas:
#461framework.Scylla) selected through iterative autoresearch and proxy validation rather than manual guesswork.val_bpbaccounting driven by explicit per-token metadata rather than SentencePiece runtime inspection.Our strategy is a stack change that starts at the tokenizer and runs all the way through evaluation:
To the best of our knowledge, this is also among the first leaderboard-caliber submissions in the competition to change the tokenizer itself rather than inherit the published
sp1024tokenization. If reviewers spot an earlier example we missed, we would be happy to correct that framing; either way, we think tokenizer search is a genuinely promising avenue here and welcome scrutiny and follow-up work.Tokenizer Journey
The tokenizer work went through several iterative stages. The short version is that we tried the obvious thing first, watched it flatten out, and then had the good sense to stop being sentimental about it.
1. SentencePiece autoresearch
We first built an autoresearch loop around SentencePiece. That loop optimized tokenizer candidates against a FineWeb-aligned screening metric and later against budget-aware heuristics.
This turned out to be useful exploration, but not the winning path:
That negative result mattered. It told us that “better tokenizer statistics” were not enough by themselves, and that larger vocabularies were often buying slim marginal gains with too much artifact budget. It also gave us permission to leave SentencePiece alone instead of continuing to hammer on a local maximum.
2. TokenMonster sidecar and proxy calibration
We then evaluated TokenMonster as a challenger family. Early cheap-screen results suggested that small TokenMonster vocabularies, especially around the
1024regime, were more promising than either larger TokenMonster vocabularies or the best SentencePiece variants.Proxy validation sharpened that impression:
3. TokenMonster-only autoresearch
We then narrowed the search into a TokenMonster-only lane. After broadening the proposal policy away from tiny local resize-only edits, the best line became a lightly pruned derivative of
english-1024-clean-v1.That candidate, tracked internally as
tm0054and nicknamed Scylla, kept the good byte efficiency of the parent vocabulary while reducing waste in the active vocabulary.This was then promoted through:
The important negative result was that larger-vocab and SentencePiece-side improvements looked better on cheap screening than they did in proxy or full runs. The winning lesson was not “make the tokenizer bigger.” It was “make the tokenizer better aligned to the artifact budget and to the tiny-model learning dynamics.”
If this submission does end up being among the first tokenizer-changing entries seriously pushed to the top of the leaderboard, we would be delighted to see other people push on the same door. This competition has been especially exciting for cultivating unusual and interesting ideas, and we think tokenizer search deserves a place in that mix.
Full-Data Bundle
For the corrected competition path, we built a full-data
Scyllabundle from the publishedsp1024FineWeb export by retokenizing in shard order.The corrected bundle uses:
79train shards1val shardRuntime tokenizer assets:
candidate.vocabcandidate.meta.npzThe metadata artifact supplies:
so the runtime path does not need SentencePiece to inspect tokenizer internals during evaluation.
A compact audit note is included in
TOKENIZER_VALIDATION.md.Legality
This record path is intended to stay within the currently accepted legality standard:
Backward-looking, score-first TTT following PR
#461's framework:Score-first protocol: the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal score-before-update TTT pattern that organizers have treated as legal in the adaptive track discussion and accepted submissions.
Implementation Notes
The main script in this folder is the promoted legal TTT stack adapted for tokenizer bundles:
TOKENIZER_PATHpoints to the promoted tokenizer vocabTOKENIZER_META_PATHpoints to the exported metadata LUTsTTT_ENABLED=1The strongest path found so far combines:
ScyllatokenizerIncluded Files
train_gpt.pycandidate.vocabcandidate.meta.npzmanifest.jsontrain_seed42.logtrain_seed1337.logtrain_seed2026.logTOKENIZER_VALIDATION.mdAcknowledgements
Thanks to @0hq and @valerio-oai for organizing, maintaining, and moderating an unusually fun and technically demanding competition.
The tokenizer lane also benefited from reading and learning from other competitors’ public work, especially the broader discussion around legal evaluation methods and tokenizer tradeoffs.