Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect by itazap · Pull Request #45936 · huggingface/transformers

itazap · 2026-05-13T07:53:44Z

we don't check the tokenization of all the model paths we have in the transformers repo. Fixes #45488, Related to #44255

We have models that don't have their own dedicated Tokenizer class and use another model's tokenizer (ex. Granite which uses GPT2Tokenizer - related issue: #45813) ). The different model tokenizer class would be mapped in the tokenization_auto.py mapping, or in the tokenization_config.json. Sometimes the mapped tokenizer isn't actually the one that is being used, and v5 surfaced these incorrect mappings. In order to "stay true to" the pre-v5 behavior of these models, we can map them to TokenizersBackend (eq. to PreTrainedTokenizerFast in v4) which loads the tokenizer.json as is. This happens because in v5 we actually try to load the mapped tokenizer class and force the same tokenizer type.

Anyway we only test tokenization of models that have their own tokenizer class but we should test tokenization for every checkpoint we have in the repo!

This PR

tools/tokenizer_compare/run_equivalence_comparison_all_checkpoints.py script

(based on that in #44255)

python tools/tokenizer_compare/run_equivalence_comparison_all_checkpoints.py
  --rerun-from-results tools/tokenizer_compare/output/tokenizer_compare_result.json
  --output output/tokenizer_compare_results_rerun.json

scans tests/models/test_modeling_*.py for .from_pretrained(...) and extracts all the checkpoint paths we list, and compares _tokenizers loaded via AutoTokenizer.from_pretrained vs TokenizersBackend.from_pretrained.

tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py script

check if TokenziersBackend and AutoTokenizer behave the same way, producing same roundtrip results on xlni and generate the same token ids

python tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py bigcode/tiny_starcoder_py v1 --backend tokenizers --compare-backend auto

Report

PRE this PR / Fixes - on all the checkpoints we list: report

POST this PR / Fixes: report

Remaining patterns:

Pattern 1 & 2

Expected and safe to ignore

these behave identically as per (tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py). These are legacy llama (already caught here: [vllm + v5 fix] handle TokenizersBackend fallback properly for v5 #44255)

Pattern 3

opened a PR on the hub https://huggingface.co/allenai/dolma2-tokenizer/discussions/1 , there is no config.json so we cannot force TokenizersBackend from tokenization_auto.py

Pattern 4

opened a PR on the hub https://huggingface.co/google/umt5-small/discussions/2 google/umt5-small

When we need precompiled

the charsmap is unordered and when we use a tokenizers.normalizer it will reorder mappings (like NFKC)

TODO

adding a test that will check AutoTokenizer and TokenizersBackend loads equivalent _tokenizer objects for each path we mention in the repo

HuggingFaceDocBuilderDev · 2026-05-13T08:05:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… repo

…nizers

itazap · 2026-06-02T12:50:50Z

-            normalizers.Lowercase(),
+            normalizers.Replace(Regex("[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]"), ""),
            normalizers.Replace(Regex(" {2,}"), " "),
+            normalizers.Replace(Regex("[\u200b\u200c\u200d\u200e\u200f]"), " "),


@ArthurZucker lmk if this is preferred to having spm_precompiled_charsmap extracted and passed in as an argument. (this is needed for arabic text, equivalent on xlni)

github-actions · 2026-06-02T12:50:51Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: albert, aria, auto, camembert, mpnet, rembert, xglm, xlnet

github-actions · 2026-06-02T13:08:12Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45936&sha=ec9c26

itazap requested a review from ArthurZucker May 15, 2026 05:56

itazap mentioned this pull request May 18, 2026

LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family #45488

Open

itazap added 14 commits May 19, 2026 11:28

we don't check the tokenization of all the model paths we have in the…

8cf2fe3

… repo

fixessss

7723158

rerun with fixes

b90c6fe

deepseek-r1 fix and other cleanup

bf2a837

sort

31b1394

fixessss and style

42c6ed6

clean up scripts

c5dfd1f

clean scripts

cce2efc

org output

41385ed

rm unrelated file

e8a3ad2

clearer script names

2703d89

clean up output

ba05ad4

update output

9ec670b

albert

f0ad678

itazap force-pushed the check_repo_model_tokenizers branch from ee51ff8 to f0ad678 Compare May 19, 2026 02:30

itazap and others added 2 commits May 19, 2026 15:56

final rerun

500444b

Merge branch 'main' into check_repo_model_tokenizers

0501a67

itazap mentioned this pull request May 20, 2026

Update tokenizer mappings to use TokenizersBackend for additional models #46091

Open

itazap added 9 commits May 20, 2026 12:03

adding original report pre fix

702336d

updated report

ea78160

update test

e4b6180

revert changes to albert

5671fc4

update test

d4b6028

albert

8347636

Merge remote-tracking branch 'origin/main' into check_repo_model_toke…

0281eaa

…nizers

update debug script

5911ce9

albert precompiled charsmap alternative

ec9c26d

itazap commented Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936
itazap wants to merge 25 commits into
mainfrom
check_repo_model_tokenizers

itazap commented May 13, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

itazap Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itazap commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR

tools/tokenizer_compare/run_equivalence_comparison_all_checkpoints.py script

tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py script

Report

Remaining patterns:

Pattern 1 & 2

Pattern 3

Pattern 4

When we need precompiled

TODO

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

itazap Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itazap commented May 13, 2026 •

edited

Loading