fix: route Granite models to TokenizersBackend to preserve tokenizer.json pre-tokenizer#45813
Conversation
…json pre-tokenizer previous fix was not triggering Fixes huggingface#45812
|
Thank you @kndtran for finding this and producing a fix! I'm looking into why existing tests didn't find this so we should also add / update a test for granite models. Sorry it slipped through the cracks! |
ArthurZucker
left a comment
There was a problem hiding this comment.
fix is valid tho! as @itazap said we need to probably add tests (integration) to cover that case
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
transformers/tests/models/granitemoehybrid/test_modeling_granitemoehybrid.py Lines 373 to 384 in d4b9110 this is the only test we had for a something like self.assertEqual(tokenizer.encode("650841823", add_special_tokens=False), EXPECTED_OUTPUT)
etc. |
|
@kndtran please let me if you'd like to add a test otherwise I can! |
|
@itazap Ah, I thought you were replying to Arthur. I will add a few tests. There were a few other odd strings too. |
Verify that AutoTokenizer produces correct token IDs for digit strings, punctuation, and mixed alphanumeric inputs.
|
@itazap I added a new test class as the torch gating seemed unnecessary. Please modify as needed. Here are some tables showing the tokenization effects on the test strings. Decoded Token Comparison
Token ID Comparison
Reproducefrom transformers import AutoTokenizer, PreTrainedTokenizerFast
MODEL = "ibm-granite/granite-4.0-h-tiny"
STRINGS = ["2023", "650841823", "60-138-3818", "d.o.o", "FY2023", "ISO 9001:2015"]
broken = AutoTokenizer.from_pretrained(MODEL)
fixed = PreTrainedTokenizerFast.from_pretrained(MODEL, use_fast=True)
print(f"{'String':20s} {'v5 (broken)':45s} {'v5 (fixed)'}")
print("─" * 110)
for s in STRINGS:
b_ids = broken.encode(s, add_special_tokens=False)
f_ids = fixed.encode(s, add_special_tokens=False)
b_tok = [broken.decode([i]) for i in b_ids]
f_tok = [fixed.decode([i]) for i in f_ids]
print(f"{s:20s} {str(b_tok):45s} {str(f_tok)}")
print()
print(f"{'String':20s} {'v5 IDs (broken)':45s} {'v5 IDs (fixed)'}")
print("─" * 110)
for s in STRINGS:
b_ids = broken.encode(s, add_special_tokens=False)
f_ids = fixed.encode(s, add_special_tokens=False)
print(f"{s:20s} {str(b_ids):45s} {str(f_ids)}") |
|
Thank you @kndtran ! Appreciate your thorough investigation on this 🙌 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, granitemoehybrid |
|
run-slow: auto, granitemoehybrid |
|
This comment contains models: ["models/auto", "models/granitemoehybrid"] |
…json pre-tokenizer (huggingface#45813) * fix: add Granite models to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS Fixes huggingface#45812 * fix: route Granite models to TokenizersBackend to preserve tokenizer.json pre-tokenizer previous fix was not triggering Fixes huggingface#45812 * test: add tokenizer encoding test for Granite 4+ Verify that AutoTokenizer produces correct token IDs for digit strings, punctuation, and mixed alphanumeric inputs. --------- Co-authored-by: Khoi-Nguyen Tran <kndtran@ibm.com> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
…json pre-tokenizer (huggingface#45813) * fix: add Granite models to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS Fixes huggingface#45812 * fix: route Granite models to TokenizersBackend to preserve tokenizer.json pre-tokenizer previous fix was not triggering Fixes huggingface#45812 * test: add tokenizer encoding test for Granite 4+ Verify that AutoTokenizer produces correct token IDs for digit strings, punctuation, and mixed alphanumeric inputs. --------- Co-authored-by: Khoi-Nguyen Tran <kndtran@ibm.com> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
…json pre-tokenizer (huggingface#45813) * fix: add Granite models to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS Fixes huggingface#45812 * fix: route Granite models to TokenizersBackend to preserve tokenizer.json pre-tokenizer previous fix was not triggering Fixes huggingface#45812 * test: add tokenizer encoding test for Granite 4+ Verify that AutoTokenizer produces correct token IDs for digit strings, punctuation, and mixed alphanumeric inputs. --------- Co-authored-by: Khoi-Nguyen Tran <kndtran@ibm.com> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
…json pre-tokenizer (huggingface#45813) * fix: add Granite models to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS Fixes huggingface#45812 * fix: route Granite models to TokenizersBackend to preserve tokenizer.json pre-tokenizer previous fix was not triggering Fixes huggingface#45812 * test: add tokenizer encoding test for Granite 4+ Verify that AutoTokenizer produces correct token IDs for digit strings, punctuation, and mixed alphanumeric inputs. --------- Co-authored-by: Khoi-Nguyen Tran <kndtran@ibm.com> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
What does this PR do?
Change
TOKENIZER_MAPPING_NAMESfor Granite model types from"GPT2Tokenizer"to"TokenizersBackend"so thatAutoTokenizerloadstokenizer.jsonfaithfully instead of routing throughGPT2Tokenizer.__init__which hardcodes a wrong pre-tokenizer.Fixes #45812
Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.