Refactor slow sentencepiece tokenizers.#11716
Conversation
|
See here: |
acf704f to
b176ef5
Compare
|
rebased on upstrem/master |
sentencepiece.decode to implement convert_tokens_to_string.|
We need to rebase on master after PR #11737 has been merged. |
f121d18 to
8dc9db6
Compare
|
Rebased on master - CI is green again. :-) |
8dc9db6 to
4000bed
Compare
|
Rebased on master to get integration tests - see #11737 |
4000bed to
9a06625
Compare
9a06625 to
53c9e49
Compare
|
Rebased on master |
622bf36 to
5170463
Compare
There was a problem hiding this comment.
Hey @LysandreJik - I would like to hear your feedback about this function.
Is it cool to refactor it into the base class? Or is it overengineered?
Thanks
Philip
There was a problem hiding this comment.
I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.
LysandreJik
left a comment
There was a problem hiding this comment.
This is an interesting proposal! I'm not sure I understand everything, so I'm asking a few questions :)
There was a problem hiding this comment.
I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.
There was a problem hiding this comment.
I'm not too keen on having an additional abstraction layer PreTrainedSentencepieceTokenizer.
I thought the original idea was to replace instances of "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() by self._tokenizer.decode(tokens) .
Could you explain why the changes proposed here are necessary?
There was a problem hiding this comment.
Could you explain why the changes proposed here are necessary?
Well. While doing this refactoring I noticed a lot of duplicate code in the tokenizers implementations.
Since wet code is hard to maintain I tried to refactor it.
But if you do not like my refactoring I can just do the single change and that's it.
@LysandreJik |
|
The general approach of the library is to keep the number of abstractions as low as possible, and to keep implementations as separate as possible from each other, hence the high amount of copy-pasted code. We want users to be able to experiment with single models/tokenizers without their changes impacting other models or tokenizers - and we want them to be able to understand how a model or tokenizer behaves by simply checking a single file, rather than having to hop around multiple files. We are failing at this with tokenizers as there are already two levels of abstraction, but adding a third one isn't really the direction we want to head to :) Does that make sense? |
Yes. Sure. Your project, your call. |
9a4d6fa to
960a76f
Compare
|
@LysandreJik I have redone the PR. Everything is green and the changes are as simple as planned in the issue. Averything is tested by setting |
LysandreJik
left a comment
There was a problem hiding this comment.
Thank you @PhilipMay, LGTM!
PR for #11646
ToDo
AlbertTokenizerBarthezTokenizerBertGenerationTokenizerBigBirdTokenizerCamembertTokenizerDebertaV2TokenizerM2M100TokenizerMarianTokenizerMBart50TokenizerPegasusTokenizerReformerTokenizerSpeech2TextTokenizerT5TokenizerXLMProphetNetTokenizerXLM RoBERTaXLNetTokenizer