fix: restore vocabulary loading in CamembertTokenizer#45714
Open
Milan-Bhimani wants to merge 3 commits into
Open
fix: restore vocabulary loading in CamembertTokenizer#45714Milan-Bhimani wants to merge 3 commits into
Milan-Bhimani wants to merge 3 commits into
Conversation
In transformers v5.7.0, the CamembertTokenizer failed to utilize the provided `vocab_file` during initialization. This resulted in the tokenizer falling back to a dummy vocabulary of ~8 tokens, causing a regression where models (such as almanach/camembertv2-base) exhibited character-level tokenization and excessive <unk> tokens. This commit adds the necessary logic to load the SentencePiece model from the `vocab_file` when no explicit `vocab` dictionary is provided, ensuring the correct subword tokenization is restored. Closes huggingface#45701
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: camembert |
Contributor
There was a problem hiding this comment.
Pull request overview
Restores correct vocabulary loading for CamembertTokenizer when a vocab_file is provided, addressing a regression in v5.7.0 that caused fallback to character-level tokenization (issue #45701).
Changes:
- Add a
vocab_filebranch to populateself._vocabfrom a SentencePiece model file. - Unify tokenizer construction so
unk_idis derived from the loaded vocab in all cases.
Collaborator
|
hey sorry but we can't always force reading the vocab_file , we want to have a proper template enforced if we specify the tokenizer class |
Collaborator
There was a problem hiding this comment.
I think the agent did not really spend enough thinking tokens..... joking a bit but as @itazap said, this is reallllllllllllllllllly not what we want to go with :)
| unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) | ||
| self._tokenizer = Tokenizer(Unigram(self._vocab, unk_id=unk_index, byte_fallback=False)) | ||
| elif vocab_file is not None: | ||
| import sentencepiece as spm |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixed a regression in v5.7.0 where CamembertTokenizer ignored the vocab_file, causing a fallback to character-level tokenization. Closes #45701