chore(trainer): add and improve trainer signature by shenxiangzhuang · Pull Request #1838 · huggingface/tokenizers

shenxiangzhuang · 2025-07-31T05:20:38Z

No description provided.

ArthurZucker

This file is automatically generated! # Generated content DO NOT EDIT What you can do is update the WordLevelTrainer 's signature in the bindings!

shenxiangzhuang · 2025-08-29T07:55:03Z

Hi @ArthurZucker , thanks for the reminder, I didn't see that warning before! I have restored the changes and added the signatures for BpeTrainer and WordLevelTrainer. Beside, I made a tiny space improve for WordPieceTrainer. Please review it again~

ArthurZucker

Make sure lists and dict are not defaulted to [] or {} and then run make in bindings/python to generate the init file

ArthurZucker · 2025-08-29T08:21:20Z

bindings/python/src/trainers.rs

-    #[pyo3(signature = (**kwargs), text_signature = None)]
+    #[pyo3(
+        signature = (**kwargs), 
+        text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})"


Suggested change

text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})"

text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=None, limit_alphabet=None, initial_alphabet=None, continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words=None)"

default to mutable is a mess in python

Well, I also notice this. There is a question, I find that text_signature in WordPieceTrainer already use []:

#[new] #[pyo3( signature = (** kwargs), text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet= [],continuing_subword_prefix=\"##\", end_of_word_suffix=None)" )] pub fn new(kwargs: Option<&Bound<'_, PyDict>>) -> PyResult<(Self, PyTrainer)> {

And, I tested it in python, and find that the default values are just same as the text_signature:

In [5]: import tokenizers; tokenizers.trainers.WordPieceTrainer() Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))

And the doc for WordPieceTrainer is consistent with the real default values:

Init signature: tokenizers.trainers.WordPieceTrainer( self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix='##', end_of_word_suffix=None, ) Docstring: Trainer capable of training a WordPiece model ...(ignored)

So maybe we should keep the text_signature same the real default value to avoid introduce break changes and confuse the users(due to the inconsistence) too.

Ok no worries then

shenxiangzhuang · 2025-08-29T09:40:03Z

To ease the review process, I copied current default values of different trainers here:

In [1]: import tokenizers

In [2]: tokenizers.trainers.BpeTrainer()
Out[2]: BpeTrainer(BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))

In [3]: tokenizers.trainers.UnigramTrainer()
Out[3]: UnigramTrainer(UnigramTrainer(show_progress=True, vocab_size=8000, n_sub_iterations=2, shrinking_factor=0.75, special_tokens=[], initial_alphabet=[], unk_token=None, max_piece_length=16, seed_size=1000000, words={}))

In [4]: tokenizers.trainers.WordLevelTrainer()
Out[4]: WordLevelTrainer(WordLevelTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], words={}))

In [5]: tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))

In [6]: tokenizers.__version__
Out[6]: '0.21.4-dev.0'

ArthurZucker

Thanks 😉

HuggingFaceDocBuilderDev · 2025-08-29T16:07:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chore(trainers): add __init__ to fix python type check errors

98ac665

ArthurZucker reviewed Aug 28, 2025

View reviewed changes

shenxiangzhuang added 2 commits August 28, 2025 15:07

restore

7ccf93c

chore(trainer): add and improve trainer signature

0547793

shenxiangzhuang changed the title ~~chore(trainers): add __init__ to fix python type check errors in WordLevelTrainer~~ chore(trainer): add and improve trainer signature Aug 29, 2025

shenxiangzhuang requested a review from ArthurZucker August 29, 2025 08:03

ArthurZucker reviewed Aug 29, 2025

View reviewed changes

shenxiangzhuang marked this pull request as draft August 29, 2025 09:21

shenxiangzhuang force-pushed the chore/fix_trainer_types branch from 74ac182 to 0547793 Compare August 29, 2025 09:31

shenxiangzhuang added 2 commits August 29, 2025 17:33

clean fix

3d77532

Merge branch 'huggingface:main' into chore/fix_trainer_types

69a2814

shenxiangzhuang marked this pull request as ready for review August 29, 2025 09:48

shenxiangzhuang requested a review from ArthurZucker August 29, 2025 09:49

ArthurZucker approved these changes Aug 29, 2025

View reviewed changes

Merge branch 'main' into chore/fix_trainer_types

e127013

chore(fmt): fix cargo fmt error

8da5806

ArthurZucker merged commit c0d3697 into huggingface:main Sep 16, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(trainer): add and improve trainer signature#1838

chore(trainer): add and improve trainer signature#1838
ArthurZucker merged 7 commits intohuggingface:mainfrom
shenxiangzhuang:chore/fix_trainer_types

shenxiangzhuang commented Jul 31, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

shenxiangzhuang commented Aug 29, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Aug 29, 2025

Uh oh!

shenxiangzhuang Aug 29, 2025

Uh oh!

ArthurZucker Aug 29, 2025

Uh oh!

shenxiangzhuang commented Aug 29, 2025 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})"
	text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=None, limit_alphabet=None, initial_alphabet=None, continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words=None)"

Conversation

shenxiangzhuang commented Jul 31, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

shenxiangzhuang commented Aug 29, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

shenxiangzhuang Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

shenxiangzhuang commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shenxiangzhuang commented Aug 29, 2025 •

edited

Loading