chore(trainer): add and improve trainer signature#1838
chore(trainer): add and improve trainer signature#1838ArthurZucker merged 7 commits intohuggingface:mainfrom
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
This file is automatically generated! # Generated content DO NOT EDIT What you can do is update the WordLevelTrainer 's signature in the bindings!
__init__ to fix python type check errors in WordLevelTrainer|
Hi @ArthurZucker , thanks for the reminder, I didn't see that warning before! I have restored the changes and added the signatures for |
ArthurZucker
left a comment
There was a problem hiding this comment.
Make sure lists and dict are not defaulted to [] or {} and then run make in bindings/python to generate the init file
| #[pyo3(signature = (**kwargs), text_signature = None)] | ||
| #[pyo3( | ||
| signature = (**kwargs), | ||
| text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" |
There was a problem hiding this comment.
| text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" | |
| text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=None, limit_alphabet=None, initial_alphabet=None, continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words=None)" |
default to mutable is a mess in python
There was a problem hiding this comment.
Well, I also notice this. There is a question, I find that text_signature in WordPieceTrainer already use []:
#[new]
#[pyo3(
signature = (** kwargs),
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet= [],continuing_subword_prefix=\"##\", end_of_word_suffix=None)"
)]
pub fn new(kwargs: Option<&Bound<'_, PyDict>>) -> PyResult<(Self, PyTrainer)> {And, I tested it in python, and find that the default values are just same as the text_signature:
In [5]: import tokenizers; tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))And the doc for WordPieceTrainer is consistent with the real default values:
Init signature:
tokenizers.trainers.WordPieceTrainer(
self,
vocab_size=30000,
min_frequency=0,
show_progress=True,
special_tokens=[],
limit_alphabet=None,
initial_alphabet=[],
continuing_subword_prefix='##',
end_of_word_suffix=None,
)
Docstring:
Trainer capable of training a WordPiece model
...(ignored)
So maybe we should keep the text_signature same the real default value to avoid introduce break changes and confuse the users(due to the inconsistence) too.
74ac182 to
0547793
Compare
|
To ease the review process, I copied current default values of different trainers here: In [1]: import tokenizers
In [2]: tokenizers.trainers.BpeTrainer()
Out[2]: BpeTrainer(BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))
In [3]: tokenizers.trainers.UnigramTrainer()
Out[3]: UnigramTrainer(UnigramTrainer(show_progress=True, vocab_size=8000, n_sub_iterations=2, shrinking_factor=0.75, special_tokens=[], initial_alphabet=[], unk_token=None, max_piece_length=16, seed_size=1000000, words={}))
In [4]: tokenizers.trainers.WordLevelTrainer()
Out[4]: WordLevelTrainer(WordLevelTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], words={}))
In [5]: tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))
In [6]: tokenizers.__version__
Out[6]: '0.21.4-dev.0' |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
No description provided.