Skip to content

Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143

Closed
simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus:codex/tm0054-autoresearch-pr-prep
Closed

Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1143
simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus:codex/tm0054-autoresearch-pr-prep

Conversation

@simon-marcus
Copy link
Copy Markdown

@simon-marcus simon-marcus commented Mar 30, 2026

Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)

Results

Seed step_avg steps roundtrip sliding legal_ttt_exact bytes_total
42 84.63ms 7091 1.10466967 1.08295388 1.08008661 15,866,740
1337 84.71ms 7084 1.10565088 1.08398224 1.08102737 15,850,756
2026 84.65ms 7089 1.10490932 1.08315990 1.08058261 15,849,792
Mean 84.66ms 7088 1.10507662 1.08336534 1.08056553 15,855,763

Against the currently accepted leader #549 at 1.1194, this is an improvement of 0.03883447 BPB, or about 3.47%.

Summary

This submission combines three ideas:

  1. A backward-looking, score-first TTT evaluation path following the accepted PR #461 framework.
  2. A custom TokenMonster-derived tokenizer (Scylla) selected through iterative autoresearch and proxy validation rather than manual guesswork.
  3. A full-data retokenized FineWeb competition bundle using that tokenizer, with runtime val_bpb accounting driven by explicit per-token metadata rather than SentencePiece runtime inspection.

Our strategy is a stack change that starts at the tokenizer and runs all the way through evaluation:

  • tokenizer family search
  • budget-aware tokenizer screening
  • proxy promotion and rejection of dead ends
  • exact runtime byte accounting
  • full-data retokenization into the promoted tokenizer
  • legal score-first adaptive evaluation

To the best of our knowledge, this is also among the first leaderboard-caliber submissions in the competition to change the tokenizer itself rather than inherit the published sp1024 tokenization. If reviewers spot an earlier example we missed, we would be happy to correct that framing; either way, we think tokenizer search is a genuinely promising avenue here and welcome scrutiny and follow-up work.

Tokenizer Journey

The tokenizer work went through several iterative stages. The short version is that we tried the obvious thing first, watched it flatten out, and then had the good sense to stop being sentimental about it.

1. SentencePiece autoresearch

We first built an autoresearch loop around SentencePiece. That loop optimized tokenizer candidates against a FineWeb-aligned screening metric and later against budget-aware heuristics.

This turned out to be useful exploration, but not the winning path:

  • locally, (i.e., on my MacBook Pro) SentencePiece candidates improved the cheap tokenizer-screen metric
  • in proxy model runs with beefier hardware, those gains mostly failed to transfer
  • the search quickly saturated in a narrow neighborhood

That negative result mattered. It told us that “better tokenizer statistics” were not enough by themselves, and that larger vocabularies were often buying slim marginal gains with too much artifact budget. It also gave us permission to leave SentencePiece alone instead of continuing to hammer on a local maximum.

2. TokenMonster sidecar and proxy calibration

We then evaluated TokenMonster as a challenger family. Early cheap-screen results suggested that small TokenMonster vocabularies, especially around the 1024 regime, were more promising than either larger TokenMonster vocabularies or the best SentencePiece variants.

Proxy validation sharpened that impression:

  • large TokenMonster variants did not hold up, but small TokenMonster variants did
  • the best direction was not “bigger tokenizer”, it was “simpler tokenizer, slightly pruned, same strong byte efficiency”

3. TokenMonster-only autoresearch

We then narrowed the search into a TokenMonster-only lane. After broadening the proposal policy away from tiny local resize-only edits, the best line became a lightly pruned derivative of english-1024-clean-v1.

That candidate, tracked internally as tm0054 and nicknamed Scylla, kept the good byte efficiency of the parent vocabulary while reducing waste in the active vocabulary.

This was then promoted through:

  • tokenizer screening
  • proxy validation
  • matched local training comparison
  • legal-TTT ladder testing
  • full-data bundle export

The important negative result was that larger-vocab and SentencePiece-side improvements looked better on cheap screening than they did in proxy or full runs. The winning lesson was not “make the tokenizer bigger.” It was “make the tokenizer better aligned to the artifact budget and to the tiny-model learning dynamics.”

If this submission does end up being among the first tokenizer-changing entries seriously pushed to the top of the leaderboard, we would be delighted to see other people push on the same door. This competition has been especially exciting for cultivating unusual and interesting ideas, and we think tokenizer search deserves a place in that mix.

Full-Data Bundle

For the corrected competition path, we built a full-data Scylla bundle from the published sp1024 FineWeb export by retokenizing in shard order.

The corrected bundle uses:

  • 79 train shards
  • 1 val shard
  • preserved shard ordering
  • preserved validation ordering

Runtime tokenizer assets:

  • candidate.vocab
  • candidate.meta.npz

The metadata artifact supplies:

  • per-token byte lengths
  • leading-space flags
  • boundary-token flags

so the runtime path does not need SentencePiece to inspect tokenizer internals during evaluation.

A compact audit note is included in TOKENIZER_VALIDATION.md.

Legality

This record path is intended to stay within the currently accepted legality standard:

  • no target-conditioned mixing
  • score-first TTT only
  • full-data retokenized bundle with explicit metadata-driven byte accounting

Backward-looking, score-first TTT following PR #461's framework:

  • score a chunk first
  • only then adapt on that already-scored chunk
  • never use future tokens to change the distribution assigned to already-scored tokens

Score-first protocol: the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal score-before-update TTT pattern that organizers have treated as legal in the adaptive track discussion and accepted submissions.

Implementation Notes

The main script in this folder is the promoted legal TTT stack adapted for tokenizer bundles:

  • TOKENIZER_PATH points to the promoted tokenizer vocab
  • TOKENIZER_META_PATH points to the exported metadata LUTs
  • TTT_ENABLED=1

The strongest path found so far combines:

  • the promoted Scylla tokenizer
  • legal score-first TTT
  • the current tuned 11-layer legal stack

Included Files

  • train_gpt.py
  • candidate.vocab
  • candidate.meta.npz
  • manifest.json
  • train_seed42.log
  • train_seed1337.log
  • train_seed2026.log
  • TOKENIZER_VALIDATION.md

Acknowledgements

Thanks to @0hq and @valerio-oai for organizing, maintaining, and moderating an unusually fun and technically demanding competition.

The tokenizer lane also benefited from reading and learning from other competitors’ public work, especially the broader discussion around legal evaluation methods and tokenizer tradeoffs.

@simon-marcus simon-marcus changed the title Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) Mar 30, 2026
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Complete pipeline to beat openai#1 (1.0806 BPB):
- train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading
- retokenize.py: TokenMonster retokenization of FineWeb
- deploy_scylla.sh: two-phase deploy (retokenize once, train many)

Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack
(GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer.
Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
@NoesisGenesis
Copy link
Copy Markdown

NoesisGenesis commented Mar 31, 2026

Interesting submission! A few findings from a close read of the code:

The Scylla submission's TTT implementation follows the score-first pattern correctly and satisfies conditions 1-4 of #1017. The architecture and quantization appear to be clean. My concern is with the BPB calculation under the custom tokenizer.

TokenMonster uses modifier tokens (IDs 36, 37, 38, 151, 152) that alter the byte output of the following token. For example, tokens 37, 151, and 152 delete the leading space from the next token, reducing its decoded byte length by one. The per-token metadata in candidate.meta.npz stores base_bytes as each token's standalone byte length, but the actual byte output is context-dependent: it changes based on whether a modifier precedes the token.

Concretely, if token 151 (capitalize + delete-space) is followed by token 305 (" he", base_bytes=3), the metadata sums to 0 + 3 = 3 bytes, but the decoded output is "He" = 2 bytes. Every such modifier-token pair overcounts by one byte. Since val_bpb = total_nats / total_bytes, inflating the denominator deflates the reported score.

Across representative text, the metadata-derived byte count exceeds the true decoded byte count by approximately 6%. Applied to the claimed 1.0806:

BPB
Claimed (metadata bytes) 1.0806
Corrected (true decoded bytes) ~1.148
Previously merged SOTA 1.1194

With correct byte accounting, the submission does not beat the previously merged leader.

So this is a "rule 15" issue: the custom tokenizer's BPB calculation is not correct. The TOKENIZER_VALIDATION.md states that byte counting is "metadata-driven and deterministic," which is true of the mechanism but not of the result. A straightforward fix would be to compute total_bytes by summing the byte lengths of each token's actual decoded output in sequence context, rather than summing standalone base_bytes values from the metadata array.

Separately, the tokenizer cannot represent 131 of 256 possible byte values (including 101 high bytes needed for UTF-8). Non-ASCII input is handled via NFC-to-NFD Unicode decomposition, which changes the byte representation and introduces a second source of byte-count divergence from the original text. Whether this materially affects the score depends on the non-ASCII content in the FineWeb validation slice.

The tokenizer direction itself is worth pursuing. If the byte accounting is fixed and the score still lands competitively, this would be a strong contribution.

@dexhunter
Copy link
Copy Markdown

dexhunter commented Mar 31, 2026

Great work on the tokenizer exploration — this is one of the most creative directions in the competition, and the autoresearch methodology is thorough.

I wanted to flag a byte-accounting issue that affects the reported val_bpb. Per the README:

"If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully, since bugs may unjustly improve your score."

(Also flagged in Issue #897 as a known risk with custom tokenizers.)

The issue: The submitted candidate.meta.npz has has_leading_space=0 and is_boundary_token=0 for all tokens. This means the eval byte count is just sum(base_bytes[tgt]). However, for TokenMonster vocabs, this overcounts the actual text bytes by ~4.13% due to two causes:

  1. 38 capcode space-stripping tokens (modifiers ending in D/DC/DW) — these consume the leading space from the following token during decode. They need is_boundary_token=True so the formula token_bytes = base_bytes[tgt] + (has_leading_space[tgt] & ~is_boundary[prev]) suppresses the +1 space byte correctly.

  2. 27 UTF-8 byte tokens (IDs 75-101) — each individually decodes to U+FFFD (3 UTF-8 bytes), but actually represents 1 raw byte in the original text. Setting base_bytes=1 for these fixes the overcount.

I verified on the full FineWeb val set:

# Ground truth: decode all SP1024 val tokens → count UTF-8 bytes
ground_truth_bytes = 151,080,633

# Scylla tokenization with current meta.npz (all zeros):
sum(base_bytes[t]) = 157,319,779  (+4.13% overcount)

# With corrected meta.npz (38 boundary + 27 byte-token fixes):
corrected_byte_count = 151,040,811  (-0.026%essentially exact)

Since val_bpb = (val_loss / ln2) × (tokens / total_bytes), an overcounted total_bytes deflates tokens/bytes and therefore deflates the reported BPP. With the correct byte count (151M instead of 157M), the BPP would be ~4% higher than reported:

Reported (overcounted bytes) Corrected (ground truth bytes)
Sliding BPP 1.0834 ~1.128
TTT BPP 1.0806 ~1.126

The fix is straightforward: detect capcode-D tokens via decode testing and mark as is_boundary_token=True, plus set base_bytes=1 for byte fallback tokens. Happy to share the detection/correction script if helpful.

cc @0hq @valerio-oai — this may be relevant for reviewing TokenMonster-based submissions.

@simon-marcus
Copy link
Copy Markdown
Author

@dexhunter @NoesisGenesis Thank you both for the thorough and thoughtful responses here. I am digging into the details now -- will decide whether this is a "revise and resubmit" or "fight me, dammit" situation shortly.

@simon-marcus
Copy link
Copy Markdown
Author

I need to check in on the byte accounting further -- and I don't want to leave this PR open with potentially dirty data, so I'm closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants