Skip to content

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936

Open
itazap wants to merge 25 commits into
mainfrom
check_repo_model_tokenizers
Open

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936
itazap wants to merge 25 commits into
mainfrom
check_repo_model_tokenizers

Conversation

@itazap

@itazap itazap commented May 13, 2026

Copy link
Copy Markdown
Collaborator

we don't check the tokenization of all the model paths we have in the transformers repo. Fixes #45488, Related to #44255

We have models that don't have their own dedicated Tokenizer class and use another model's tokenizer (ex. Granite which uses GPT2Tokenizer - related issue: #45813) ). The different model tokenizer class would be mapped in the tokenization_auto.py mapping, or in the tokenization_config.json. Sometimes the mapped tokenizer isn't actually the one that is being used, and v5 surfaced these incorrect mappings. In order to "stay true to" the pre-v5 behavior of these models, we can map them to TokenizersBackend (eq. to PreTrainedTokenizerFast in v4) which loads the tokenizer.json as is. This happens because in v5 we actually try to load the mapped tokenizer class and force the same tokenizer type.

Anyway we only test tokenization of models that have their own tokenizer class but we should test tokenization for every checkpoint we have in the repo!

This PR

tools/tokenizer_compare/run_equivalence_comparison_all_checkpoints.py script

(based on that in #44255)

python tools/tokenizer_compare/run_equivalence_comparison_all_checkpoints.py
  --rerun-from-results tools/tokenizer_compare/output/tokenizer_compare_result.json
  --output output/tokenizer_compare_results_rerun.json

scans tests/models/test_modeling_*.py for .from_pretrained(...) and extracts all the checkpoint paths we list, and compares _tokenizers loaded via AutoTokenizer.from_pretrained vs TokenizersBackend.from_pretrained.

tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py script

check if TokenziersBackend and AutoTokenizer behave the same way, producing same roundtrip results on xlni and generate the same token ids

python tools/tokenizer_compare/compare_tokenizer_xlni_roundtrip.py bigcode/tiny_starcoder_py v1 --backend tokenizers --compare-backend auto

Report

PRE this PR / Fixes - on all the checkpoints we list: report

POST this PR / Fixes: report

Remaining patterns:

Pattern 1 & 2

Expected and safe to ignore

Pattern 3

Pattern 4

When we need precompiled

the charsmap is unordered and when we use a tokenizers.normalizer it will reorder mappings (like NFKC)

TODO

adding a test that will check AutoTokenizer and TokenizersBackend loads equivalent _tokenizer objects for each path we mention in the repo

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the check_repo_model_tokenizers branch from ee51ff8 to f0ad678 Compare May 19, 2026 02:30
normalizers.Lowercase(),
normalizers.Replace(Regex("[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]"), ""),
normalizers.Replace(Regex(" {2,}"), " "),
normalizers.Replace(Regex("[\u200b\u200c\u200d\u200e\u200f]"), " "),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker lmk if this is preferred to having spm_precompiled_charsmap extracted and passed in as an argument. (this is needed for arabic text, equivalent on xlni)

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: albert, aria, auto, camembert, mpnet, rembert, xglm, xlnet

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45936&sha=ec9c26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family

3 participants