fix: restore vocabulary loading in CamembertTokenizer by Milan-Bhimani · Pull Request #45714 · huggingface/transformers

Milan-Bhimani · 2026-04-30T08:31:20Z

Fixed a regression in v5.7.0 where CamembertTokenizer ignored the vocab_file, causing a fallback to character-level tokenization. Closes #45701

In transformers v5.7.0, the CamembertTokenizer failed to utilize the provided `vocab_file` during initialization. This resulted in the tokenizer falling back to a dummy vocabulary of ~8 tokens, causing a regression where models (such as almanach/camembertv2-base) exhibited character-level tokenization and excessive <unk> tokens. This commit adds the necessary logic to load the SentencePiece model from the `vocab_file` when no explicit `vocab` dictionary is provided, ensuring the correct subword tokenization is restored. Closes huggingface#45701

github-actions · 2026-04-30T08:32:31Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: camembert

Copilot

Pull request overview

Restores correct vocabulary loading for CamembertTokenizer when a vocab_file is provided, addressing a regression in v5.7.0 that caused fallback to character-level tokenization (issue #45701).

Changes:

Add a vocab_file branch to populate self._vocab from a SentencePiece model file.
Unify tokenizer construction so unk_id is derived from the loaded vocab in all cases.

itazap · 2026-05-02T10:14:51Z

hey sorry but we can't always force reading the vocab_file , we want to have a proper template enforced if we specify the tokenizer class

ArthurZucker

I think the agent did not really spend enough thinking tokens..... joking a bit but as @itazap said, this is reallllllllllllllllllly not what we want to go with :)

ArthurZucker · 2026-05-13T02:32:39Z

-            unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
-            self._tokenizer = Tokenizer(Unigram(self._vocab, unk_id=unk_index, byte_fallback=False))
+        elif vocab_file is not None:
+            import sentencepiece as spm


Milan-Bhimani added 2 commits April 30, 2026 12:29

style: fix formatting in CamembertTokenizer

50b8154

Copilot AI review requested due to automatic review settings April 30, 2026 08:31

Merge branch 'main' into Milan_Bhimani_FixTokenization

e95dcf5

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread src/transformers/models/camembert/tokenization_camembert.py

Comment thread src/transformers/models/camembert/tokenization_camembert.py

Comment thread src/transformers/models/camembert/tokenization_camembert.py

ArthurZucker reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore vocabulary loading in CamembertTokenizer#45714

fix: restore vocabulary loading in CamembertTokenizer#45714
Milan-Bhimani wants to merge 3 commits into
huggingface:mainfrom
Milan-Bhimani:Milan_Bhimani_FixTokenization

Milan-Bhimani commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itazap commented May 2, 2026

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

ArthurZucker May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Milan-Bhimani commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itazap commented May 2, 2026

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArthurZucker left a comment •

edited

Loading