Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210
Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210mikeapedia wants to merge 2 commits intoopenai:mainfrom
Conversation
… split_digits=false Custom BPE tokenizer optimized for FineWeb's web-crawled text. Corpus frequency analysis informed symbol selection (10 high-frequency web patterns). Pre-tokenized dataset available on HuggingFace at Mikeapedia/parameter-golf-data. Untested on H100 — sharing for community evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! We tested the custom tokenizer on our PR #1204 reproduction pipeline (parallel residuals + mini depth recurrence + mixed int5/int6 GPTQ) and wanted to share findings. ResultsWe ran on 8×H100 SXM (GCP) with seed 1337:
The custom tokenizer produces +0.002 BPP worse results on the same architecture. The +2.82% more validation tokens means the model needs to predict ~1.75M additional tokens with the same number of parameters, which creates a BPP handicap that the improved token boundaries don't fully compensate for. Suggestions for improvement
Community rules compliance (Issue #1017)This PR appears to comply with all four conditions:
The tokenizer change itself is explicitly permitted by the README ("novel tokenizers" are listed as an encouraged direction), subject to extra scrutiny on BPP calculation correctness. The main concern is not legality but competitiveness — the +2.82% token overhead is a significant handicap that would need to be overcome through model quality improvements or a tokenizer redesign that reduces rather than increases token count. Thanks for sharing the dataset and tokenizer — this kind of open contribution helps the community explore the tokenizer direction! |
Summary
split_digits=falseand 10user_defined_symbolsfor common web patterns (URLs, TLDs)Motivation
The default tokenizer treats all text equally, but FineWeb is web-crawled content with high-frequency URL/TLD patterns. By pre-defining these as atomic symbols and keeping digit sequences intact, we hypothesize cleaner token boundaries may improve model learning.
How to Use
No code changes needed —
train_gpt.pyalready supports custom tokenizers:Data download from HuggingFace:
Status
Untested — no H100 access available. Sharing the tokenizer, data, and tooling for anyone who wants to evaluate. See README for full details and call for testing.
Test plan
🤖 Generated with Claude Code