Skip to content

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210

Open
mikeapedia wants to merge 2 commits intoopenai:mainfrom
mikeapedia:custom-tokenizer-pr
Open

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210
mikeapedia wants to merge 2 commits intoopenai:mainfrom
mikeapedia:custom-tokenizer-pr

Conversation

@mikeapedia
Copy link
Copy Markdown

Summary

  • Custom SentencePiece BPE tokenizer with split_digits=false and 10 user_defined_symbols for common web patterns (URLs, TLDs)
  • Corpus frequency analysis of 100K FineWeb docs informed symbol selection (cutoff: 500+ hits per 100K docs)
  • Pre-tokenized binary shards (82 files, ~16GB) uploaded to HuggingFace

Motivation

The default tokenizer treats all text equally, but FineWeb is web-crawled content with high-frequency URL/TLD patterns. By pre-defining these as atomic symbols and keeping digit sequences intact, we hypothesize cleaner token boundaries may improve model learning.

How to Use

No code changes needed — train_gpt.py already supports custom tokenizers:

TOKENIZER_PATH=data/tokenizers/fineweb_1024_custom.model \
DATA_PATH=data/datasets/fineweb10B_sp1024_custom \
torchrun --nproc_per_node=8 train_gpt.py

Data download from HuggingFace:

huggingface-cli download Mikeapedia/parameter-golf-data --local-dir ./data

Status

Untested — no H100 access available. Sharing the tokenizer, data, and tooling for anyone who wants to evaluate. See README for full details and call for testing.

Test plan

  • Someone with H100 access downloads data from HuggingFace
  • Runs training with TOKENIZER_PATH and DATA_PATH env vars
  • Reports val_bpb compared to default tokenizer baseline
  • Results shared in PR comments or Discord

🤖 Generated with Claude Code

mikeapedia and others added 2 commits March 31, 2026 21:55
… split_digits=false

Custom BPE tokenizer optimized for FineWeb's web-crawled text. Corpus frequency
analysis informed symbol selection (10 high-frequency web patterns). Pre-tokenized
dataset available on HuggingFace at Mikeapedia/parameter-golf-data. Untested on
H100 — sharing for community evaluation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dexhunter
Copy link
Copy Markdown

Thanks for the contribution! We tested the custom tokenizer on our PR #1204 reproduction pipeline (parallel residuals + mini depth recurrence + mixed int5/int6 GPTQ) and wanted to share findings.

Results

We ran on 8×H100 SXM (GCP) with seed 1337:

Tokenizer val_bpb Val tokens Token overhead
SP1024 (baseline) 1.10419 62,021,632
Custom (this PR) 1.10636 63,770,624 +2.82%

The custom tokenizer produces +0.002 BPP worse results on the same architecture. The +2.82% more validation tokens means the model needs to predict ~1.75M additional tokens with the same number of parameters, which creates a BPP handicap that the improved token boundaries don't fully compensate for.

Suggestions for improvement

  1. Reduce token count: The current tokenizer creates more tokens than SP1024 (63.8M vs 62.0M). A tokenizer that creates fewer tokens per byte while maintaining coverage would be more competitive — see PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 (Scylla/TokenMonster) which uses a 998-token vocab and achieves significant BPP gains through token efficiency.

  2. Model co-optimization: Simply swapping the tokenizer without retuning the model (BigramHash primes, LR schedule, warmdown) leaves potential gains on the table. The hash function constants (36313, 27191, etc.) were tuned for SP1024 — they may not be optimal for different token distributions.

  3. Byte accounting verification: Per the README: "If changes are made to the tokenizer, prove with certainty that the val_bpb is correctly calculated." We verified that build_sentencepiece_luts() produces correct byte counts for this tokenizer (the U+2581 leading space token exists as expected), but note this is something organizers will scrutinize carefully.

Community rules compliance (Issue #1017)

This PR appears to comply with all four conditions:

  • ✅ Condition 1 (Causality): Standard autoregressive prediction, no future token dependence
  • ✅ Condition 2 (Full normalized distribution): Standard softmax over full vocab
  • ✅ Condition 3 (Score-before-update): No TTT, no eval-time adaptation
  • ✅ Condition 4 (Single L→R pass): Standard sliding window evaluation

The tokenizer change itself is explicitly permitted by the README ("novel tokenizers" are listed as an encouraged direction), subject to extra scrutiny on BPP calculation correctness.

The main concern is not legality but competitiveness — the +2.82% token overhead is a significant handicap that would need to be overcome through model quality improvements or a tokenizer redesign that reduces rather than increases token count.

Thanks for sharing the dataset and tokenizer — this kind of open contribution helps the community explore the tokenizer direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants