Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset by mikeapedia · Pull Request #1210 · openai/parameter-golf

mikeapedia · 2026-04-01T04:55:55Z

Summary

Custom SentencePiece BPE tokenizer with split_digits=false and 10 user_defined_symbols for common web patterns (URLs, TLDs)
Corpus frequency analysis of 100K FineWeb docs informed symbol selection (cutoff: 500+ hits per 100K docs)
Pre-tokenized binary shards (82 files, ~16GB) uploaded to HuggingFace

Motivation

The default tokenizer treats all text equally, but FineWeb is web-crawled content with high-frequency URL/TLD patterns. By pre-defining these as atomic symbols and keeping digit sequences intact, we hypothesize cleaner token boundaries may improve model learning.

How to Use

No code changes needed — train_gpt.py already supports custom tokenizers:

TOKENIZER_PATH=data/tokenizers/fineweb_1024_custom.model \
DATA_PATH=data/datasets/fineweb10B_sp1024_custom \
torchrun --nproc_per_node=8 train_gpt.py

Data download from HuggingFace:

huggingface-cli download Mikeapedia/parameter-golf-data --local-dir ./data

Status

Untested — no H100 access available. Sharing the tokenizer, data, and tooling for anyone who wants to evaluate. See README for full details and call for testing.

Test plan

Someone with H100 access downloads data from HuggingFace
Runs training with TOKENIZER_PATH and DATA_PATH env vars
Reports val_bpb compared to default tokenizer baseline
Results shared in PR comments or Discord

🤖 Generated with Claude Code

… split_digits=false Custom BPE tokenizer optimized for FineWeb's web-crawled text. Corpus frequency analysis informed symbol selection (10 high-frequency web patterns). Pre-tokenized dataset available on HuggingFace at Mikeapedia/parameter-golf-data. Untested on H100 — sharing for community evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dexhunter · 2026-04-01T08:23:30Z

Thanks for the contribution! We tested the custom tokenizer on our PR #1204 reproduction pipeline (parallel residuals + mini depth recurrence + mixed int5/int6 GPTQ) and wanted to share findings.

Results

We ran on 8×H100 SXM (GCP) with seed 1337:

Tokenizer	val_bpb	Val tokens	Token overhead
SP1024 (baseline)	1.10419	62,021,632	—
Custom (this PR)	1.10636	63,770,624	+2.82%

The custom tokenizer produces +0.002 BPP worse results on the same architecture. The +2.82% more validation tokens means the model needs to predict ~1.75M additional tokens with the same number of parameters, which creates a BPP handicap that the improved token boundaries don't fully compensate for.

Suggestions for improvement

Reduce token count: The current tokenizer creates more tokens than SP1024 (63.8M vs 62.0M). A tokenizer that creates fewer tokens per byte while maintaining coverage would be more competitive — see PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 (Scylla/TokenMonster) which uses a 998-token vocab and achieves significant BPP gains through token efficiency.
Model co-optimization: Simply swapping the tokenizer without retuning the model (BigramHash primes, LR schedule, warmdown) leaves potential gains on the table. The hash function constants (36313, 27191, etc.) were tuned for SP1024 — they may not be optimal for different token distributions.
Byte accounting verification: Per the README: "If changes are made to the tokenizer, prove with certainty that the val_bpb is correctly calculated." We verified that build_sentencepiece_luts() produces correct byte counts for this tokenizer (the U+2581 leading space token exists as expected), but note this is something organizers will scrutinize carefully.

Community rules compliance (Issue #1017)

This PR appears to comply with all four conditions:

✅ Condition 1 (Causality): Standard autoregressive prediction, no future token dependence
✅ Condition 2 (Full normalized distribution): Standard softmax over full vocab
✅ Condition 3 (Score-before-update): No TTT, no eval-time adaptation
✅ Condition 4 (Single L→R pass): Standard sliding window evaluation

The tokenizer change itself is explicitly permitted by the README ("novel tokenizers" are listed as an encouraged direction), subject to extra scrutiny on BPP calculation correctness.

The main concern is not legality but competitiveness — the +2.82% token overhead is a significant handicap that would need to be overcome through model quality improvements or a tokenizer redesign that reduces rather than increases token count.

Thanks for sharing the dataset and tokenizer — this kind of open contribution helps the community explore the tokenizer direction!

mikeapedia and others added 2 commits March 31, 2026 21:55

Highlight untested status and call for community testers in README

1cfb222

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mikeapedia mentioned this pull request Apr 1, 2026

Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset#1210
mikeapedia wants to merge 2 commits intoopenai:mainfrom
mikeapedia:custom-tokenizer-pr

mikeapedia commented Apr 1, 2026

Uh oh!

dexhunter commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikeapedia commented Apr 1, 2026

Summary

Motivation

How to Use

Status

Test plan

Uh oh!

dexhunter commented Apr 1, 2026

Results

Suggestions for improvement

Community rules compliance (Issue #1017)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants