Skip to content

fix: restore vocabulary loading in CamembertTokenizer#45714

Open
Milan-Bhimani wants to merge 3 commits into
huggingface:mainfrom
Milan-Bhimani:Milan_Bhimani_FixTokenization
Open

fix: restore vocabulary loading in CamembertTokenizer#45714
Milan-Bhimani wants to merge 3 commits into
huggingface:mainfrom
Milan-Bhimani:Milan_Bhimani_FixTokenization

Conversation

@Milan-Bhimani

Copy link
Copy Markdown

Fixed a regression in v5.7.0 where CamembertTokenizer ignored the vocab_file, causing a fallback to character-level tokenization. Closes #45701

In transformers v5.7.0, the CamembertTokenizer failed to utilize the
  provided `vocab_file` during initialization. This resulted in the
  tokenizer falling back to a dummy vocabulary of ~8 tokens, causing
  a regression where models (such as almanach/camembertv2-base)
  exhibited character-level tokenization and excessive <unk> tokens.

  This commit adds the necessary logic to load the SentencePiece model
  from the `vocab_file` when no explicit `vocab` dictionary is provided,
  ensuring the correct subword tokenization is restored.

  Closes huggingface#45701
Copilot AI review requested due to automatic review settings April 30, 2026 08:31
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: camembert

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores correct vocabulary loading for CamembertTokenizer when a vocab_file is provided, addressing a regression in v5.7.0 that caused fallback to character-level tokenization (issue #45701).

Changes:

  • Add a vocab_file branch to populate self._vocab from a SentencePiece model file.
  • Unify tokenizer construction so unk_id is derived from the loaded vocab in all cases.

Comment thread src/transformers/models/camembert/tokenization_camembert.py
Comment thread src/transformers/models/camembert/tokenization_camembert.py
Comment thread src/transformers/models/camembert/tokenization_camembert.py
@itazap

itazap commented May 2, 2026

Copy link
Copy Markdown
Collaborator

hey sorry but we can't always force reading the vocab_file , we want to have a proper template enforced if we specify the tokenizer class

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the agent did not really spend enough thinking tokens..... joking a bit but as @itazap said, this is reallllllllllllllllllly not what we want to go with :)

unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
self._tokenizer = Tokenizer(Unigram(self._vocab, unk_id=unk_index, byte_fallback=False))
elif vocab_file is not None:
import sentencepiece as spm

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transformers version changes the tokenization

5 participants