Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) by simon-marcus · Pull Request #1143 · openai/parameter-golf

simon-marcus · 2026-03-30T18:53:55Z

Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)

Results

Seed	step_avg	steps	roundtrip	sliding	legal_ttt_exact	bytes_total
42	84.63ms	7091	1.10466967	1.08295388	1.08008661	15,866,740
1337	84.71ms	7084	1.10565088	1.08398224	1.08102737	15,850,756
2026	84.65ms	7089	1.10490932	1.08315990	1.08058261	15,849,792
Mean	84.66ms	7088	1.10507662	1.08336534	1.08056553	15,855,763

Against the currently accepted leader #549 at 1.1194, this is an improvement of 0.03883447 BPB, or about 3.47%.

Summary

This submission combines three ideas:

A backward-looking, score-first TTT evaluation path following the accepted PR #461 framework.
A custom TokenMonster-derived tokenizer (Scylla) selected through iterative autoresearch and proxy validation rather than manual guesswork.
A full-data retokenized FineWeb competition bundle using that tokenizer, with runtime val_bpb accounting driven by explicit per-token metadata rather than SentencePiece runtime inspection.

Our strategy is a stack change that starts at the tokenizer and runs all the way through evaluation:

tokenizer family search
budget-aware tokenizer screening
proxy promotion and rejection of dead ends
exact runtime byte accounting
full-data retokenization into the promoted tokenizer
legal score-first adaptive evaluation

To the best of our knowledge, this is also among the first leaderboard-caliber submissions in the competition to change the tokenizer itself rather than inherit the published sp1024 tokenization. If reviewers spot an earlier example we missed, we would be happy to correct that framing; either way, we think tokenizer search is a genuinely promising avenue here and welcome scrutiny and follow-up work.

Tokenizer Journey

The tokenizer work went through several iterative stages. The short version is that we tried the obvious thing first, watched it flatten out, and then had the good sense to stop being sentimental about it.

1. SentencePiece autoresearch

We first built an autoresearch loop around SentencePiece. That loop optimized tokenizer candidates against a FineWeb-aligned screening metric and later against budget-aware heuristics.

This turned out to be useful exploration, but not the winning path:

locally, (i.e., on my MacBook Pro) SentencePiece candidates improved the cheap tokenizer-screen metric
in proxy model runs with beefier hardware, those gains mostly failed to transfer
the search quickly saturated in a narrow neighborhood

That negative result mattered. It told us that “better tokenizer statistics” were not enough by themselves, and that larger vocabularies were often buying slim marginal gains with too much artifact budget. It also gave us permission to leave SentencePiece alone instead of continuing to hammer on a local maximum.

2. TokenMonster sidecar and proxy calibration

We then evaluated TokenMonster as a challenger family. Early cheap-screen results suggested that small TokenMonster vocabularies, especially around the 1024 regime, were more promising than either larger TokenMonster vocabularies or the best SentencePiece variants.

Proxy validation sharpened that impression:

large TokenMonster variants did not hold up, but small TokenMonster variants did
the best direction was not “bigger tokenizer”, it was “simpler tokenizer, slightly pruned, same strong byte efficiency”

3. TokenMonster-only autoresearch

We then narrowed the search into a TokenMonster-only lane. After broadening the proposal policy away from tiny local resize-only edits, the best line became a lightly pruned derivative of english-1024-clean-v1.

That candidate, tracked internally as tm0054 and nicknamed Scylla, kept the good byte efficiency of the parent vocabulary while reducing waste in the active vocabulary.

This was then promoted through:

tokenizer screening
proxy validation
matched local training comparison
legal-TTT ladder testing
full-data bundle export

The important negative result was that larger-vocab and SentencePiece-side improvements looked better on cheap screening than they did in proxy or full runs. The winning lesson was not “make the tokenizer bigger.” It was “make the tokenizer better aligned to the artifact budget and to the tiny-model learning dynamics.”

If this submission does end up being among the first tokenizer-changing entries seriously pushed to the top of the leaderboard, we would be delighted to see other people push on the same door. This competition has been especially exciting for cultivating unusual and interesting ideas, and we think tokenizer search deserves a place in that mix.

Full-Data Bundle

For the corrected competition path, we built a full-data Scylla bundle from the published sp1024 FineWeb export by retokenizing in shard order.

The corrected bundle uses:

79 train shards
1 val shard
preserved shard ordering
preserved validation ordering

Runtime tokenizer assets:

candidate.vocab
candidate.meta.npz

The metadata artifact supplies:

per-token byte lengths
leading-space flags
boundary-token flags

so the runtime path does not need SentencePiece to inspect tokenizer internals during evaluation.

A compact audit note is included in TOKENIZER_VALIDATION.md.

Legality

This record path is intended to stay within the currently accepted legality standard:

no target-conditioned mixing
score-first TTT only
full-data retokenized bundle with explicit metadata-driven byte accounting

Backward-looking, score-first TTT following PR #461's framework:

score a chunk first
only then adapt on that already-scored chunk
never use future tokens to change the distribution assigned to already-scored tokens

Score-first protocol: the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal score-before-update TTT pattern that organizers have treated as legal in the adaptive track discussion and accepted submissions.

Implementation Notes

The main script in this folder is the promoted legal TTT stack adapted for tokenizer bundles:

TOKENIZER_PATH points to the promoted tokenizer vocab
TOKENIZER_META_PATH points to the exported metadata LUTs
TTT_ENABLED=1

The strongest path found so far combines:

the promoted Scylla tokenizer
legal score-first TTT
the current tuned 11-layer legal stack

Included Files

train_gpt.py
candidate.vocab
candidate.meta.npz
manifest.json
train_seed42.log
train_seed1337.log
train_seed2026.log
TOKENIZER_VALIDATION.md

Acknowledgements

Thanks to @0hq and @valerio-oai for organizing, maintaining, and moderating an unusually fun and technically demanding competition.

The tokenizer lane also benefited from reading and learning from other competitors’ public work, especially the broader discussion around legal evaluation methods and tokenizer tradeoffs.

…_bpb 0.0990)

Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).

NoesisGenesis · 2026-03-31T05:37:30Z

Interesting submission! A few findings from a close read of the code:

The Scylla submission's TTT implementation follows the score-first pattern correctly and satisfies conditions 1-4 of #1017. The architecture and quantization appear to be clean. My concern is with the BPB calculation under the custom tokenizer.

TokenMonster uses modifier tokens (IDs 36, 37, 38, 151, 152) that alter the byte output of the following token. For example, tokens 37, 151, and 152 delete the leading space from the next token, reducing its decoded byte length by one. The per-token metadata in candidate.meta.npz stores base_bytes as each token's standalone byte length, but the actual byte output is context-dependent: it changes based on whether a modifier precedes the token.

Concretely, if token 151 (capitalize + delete-space) is followed by token 305 (" he", base_bytes=3), the metadata sums to 0 + 3 = 3 bytes, but the decoded output is "He" = 2 bytes. Every such modifier-token pair overcounts by one byte. Since val_bpb = total_nats / total_bytes, inflating the denominator deflates the reported score.

Across representative text, the metadata-derived byte count exceeds the true decoded byte count by approximately 6%. Applied to the claimed 1.0806:

	BPB
Claimed (metadata bytes)	1.0806
Corrected (true decoded bytes)	~1.148
Previously merged SOTA	1.1194

With correct byte accounting, the submission does not beat the previously merged leader.

So this is a "rule 15" issue: the custom tokenizer's BPB calculation is not correct. The TOKENIZER_VALIDATION.md states that byte counting is "metadata-driven and deterministic," which is true of the mechanism but not of the result. A straightforward fix would be to compute total_bytes by summing the byte lengths of each token's actual decoded output in sequence context, rather than summing standalone base_bytes values from the metadata array.

Separately, the tokenizer cannot represent 131 of 256 possible byte values (including 101 high bytes needed for UTF-8). Non-ASCII input is handled via NFC-to-NFD Unicode decomposition, which changes the byte representation and introduces a second source of byte-count divergence from the original text. Whether this materially affects the score depends on the non-ASCII content in the FineWeb validation slice.

The tokenizer direction itself is worth pursuing. If the byte accounting is fixed and the score still lands competitively, this would be a strong contribution.

dexhunter · 2026-03-31T07:32:58Z

Great work on the tokenizer exploration — this is one of the most creative directions in the competition, and the autoresearch methodology is thorough.

I wanted to flag a byte-accounting issue that affects the reported val_bpb. Per the README:

"If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully, since bugs may unjustly improve your score."

(Also flagged in Issue #897 as a known risk with custom tokenizers.)

The issue: The submitted candidate.meta.npz has has_leading_space=0 and is_boundary_token=0 for all tokens. This means the eval byte count is just sum(base_bytes[tgt]). However, for TokenMonster vocabs, this overcounts the actual text bytes by ~4.13% due to two causes:

38 capcode space-stripping tokens (modifiers ending in D/DC/DW) — these consume the leading space from the following token during decode. They need is_boundary_token=True so the formula token_bytes = base_bytes[tgt] + (has_leading_space[tgt] & ~is_boundary[prev]) suppresses the +1 space byte correctly.
27 UTF-8 byte tokens (IDs 75-101) — each individually decodes to U+FFFD (3 UTF-8 bytes), but actually represents 1 raw byte in the original text. Setting base_bytes=1 for these fixes the overcount.

I verified on the full FineWeb val set:

# Ground truth: decode all SP1024 val tokens → count UTF-8 bytes
ground_truth_bytes = 151,080,633

# Scylla tokenization with current meta.npz (all zeros):
sum(base_bytes[t]) = 157,319,779  (+4.13% overcount)

# With corrected meta.npz (38 boundary + 27 byte-token fixes):
corrected_byte_count = 151,040,811  (-0.026% — essentially exact)

Since val_bpb = (val_loss / ln2) × (tokens / total_bytes), an overcounted total_bytes deflates tokens/bytes and therefore deflates the reported BPP. With the correct byte count (151M instead of 157M), the BPP would be ~4% higher than reported:

	Reported (overcounted bytes)	Corrected (ground truth bytes)
Sliding BPP	1.0834	~1.128
TTT BPP	1.0806	~1.126

The fix is straightforward: detect capcode-D tokens via decode testing and mark as is_boundary_token=True, plus set base_bytes=1 for byte fallback tokens. Happy to share the detection/correction script if helpful.

cc @0hq @valerio-oai — this may be relevant for reviewing TokenMonster-based submissions.

simon-marcus · 2026-03-31T13:42:52Z

@dexhunter @NoesisGenesis Thank you both for the thorough and thoughtful responses here. I am digging into the details now -- will decide whether this is a "revise and resubmit" or "fight me, dammit" situation shortly.

simon-marcus · 2026-03-31T14:44:56Z

I need to check in on the byte accounting further -- and I don't want to leave this PR open with potentially dirty data, so I'm closing for now.

simon-marcus added 2 commits March 26, 2026 14:22

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val…

c996cee

…_bpb 0.0990)

Add Scylla tokenizer legal TTT record

07c721f

notapplica mentioned this pull request Mar 30, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

simon-marcus changed the title ~~Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)~~ Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) Mar 30, 2026

simon-marcus closed this Mar 31, 2026

icryo mentioned this pull request Mar 31, 2026

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184

Open

5 tasks

dexhunter mentioned this pull request Apr 1, 2026

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset #1210

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143

Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143
simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus:codex/tm0054-autoresearch-pr-prep

simon-marcus commented Mar 30, 2026 •

edited

Loading

Uh oh!

NoesisGenesis commented Mar 31, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 31, 2026 •

edited

Loading

Uh oh!

simon-marcus commented Mar 31, 2026

Uh oh!

simon-marcus commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simon-marcus commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)

Results

Summary

Tokenizer Journey

1. SentencePiece autoresearch

2. TokenMonster sidecar and proxy calibration

3. TokenMonster-only autoresearch

Full-Data Bundle

Legality

Implementation Notes

Included Files

Acknowledgements

Uh oh!

NoesisGenesis commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dexhunter commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented Mar 31, 2026

Uh oh!

simon-marcus commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simon-marcus commented Mar 30, 2026 •

edited

Loading

NoesisGenesis commented Mar 31, 2026 •

edited

Loading

dexhunter commented Mar 31, 2026 •

edited

Loading