Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
61c9d5b to
1dd79be
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Cool! Nice splitting, let's simplify a bit when we do update the pre_tokenizer and normalizer + add mock tests!
There was a problem hiding this comment.
Again the last piece that is missing for me here is to update the __getstate__ that is used when we save the tokenizer to make sure we reset the normalzier_spec to self.add_prefix_space
There was a problem hiding this comment.
Looked into it, the normalizer_spec.add_dummy_prefix is already reset when loading from saved. Checked with "huggyllama/llama-7b" and normalizer_spec.add_dummy_prefix was set to True before again forcing to be False
| sequence["normalizers"] = [pt for pt in curr_state["normalizers"] if pt["type"] not in ["Prepend"]] | ||
| else: | ||
| return | ||
| if getattr(self, "legacy", True): |
There was a problem hiding this comment.
If self.legacy, we should not touch anything. What we want to fix is people that don't want the legacy behaviour, but don't need to do the slow conversion / don't have sentence piece installed !
There was a problem hiding this comment.
if self.legacy is True, we just return
| """Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings.""" | ||
| sequence = json.loads(normalizers.Sequence([]).__getstate__()) | ||
| final_sequence = normalizers.Sequence([]) | ||
| if self._tokenizer.normalizer is not None and type(self._tokenizer.normalizer) in ( |
There was a problem hiding this comment.
We can use isinstance(self._tokenizer.normalizer, (normalizers.Prepend, normalizers.Precompiled)) rather than type
| normalizers.Precompiled, | ||
| ): | ||
| # If normalizer is not a Sequence, add it to a sequence | ||
| sequence["normalizers"].append(json.loads(self._tokenizer.normalizer.__getstate__().decode("utf-8"))) |
There was a problem hiding this comment.
if we have a prepend normalizer, we want to remove it (of course everything should be in the if legacy = False ).
Here again, we only want to update for people who do not want to convert from slow / load the fast tokenizer but legacy is false
| def _update_pre_tokenizer(self): | ||
| """Updates the underlying pre-tokenizer with the current `add_prefix_space` setting.""" | ||
|
|
||
| # 'add_prefix_space' not passed, try to read from normalizer, otherwise do not set. |
There was a problem hiding this comment.
Here we should also first check if we have legacy set to False. If it is True we return without doing anything. If legacy is set to False, then we can check if add_prefix_space was passed or not.
If it is not passed, then we :
- remove prepend if prepend was in the sequence. I think shipping native pythonic way of doing this is gonna be a priority for me
- add the MetaspaceTokenizer if we had the PrependNormalizer
- set prepend to "first' if we had metaspace
- set add_prefix_space of the bytelevel tokenizer to true / false if there was one
- do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well)
There was a problem hiding this comment.
@ArthurZucker Thank you for the detailed review!
Q.
-
set prepend to "first' if we had metaspace - based on
convert_slow_tokenizer, prepend isalwaysonly iflegacy=True. So here should it be updated to always set tofirst? (if we are not touching legacy) -
do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well) - what if there was no pretokenizer and a Metaspace one was added? do we need to add a ByteLevel decoder ?(if yes - for which cases (
prepend _scheme=firstand/ornever)
There was a problem hiding this comment.
- if legacy:
- do nothing
- else:
- if there is a MetaSpace:
- set prepend to "first" if
add_prefix_space - set prepend to "never" if not
add_prefix_space
- set prepend to "first" if
- if there is a prepend Normalizer:
- we could check the type, mostly Llama is the one using it so let's just do a test on the class. I don't expect other class to need this!
- replace with a MetaSpace, with either first or never
- either replace the decoder, or leave it as it. Most probably we can leave it since LLama decoder was already stripping left and replacing. This is the tricky part since you need to guess what to do for our users. If add_prefix space we need to make sure the decoder does not remove, etc.
- IMO we can just replace the decoder if it is strip + replace with metaspace + good prepend scheme!
- if there is a MetaSpace:
There was a problem hiding this comment.
Thank you! Applied in next review!
notes on the decoder:
- decoder's
add_prefix_spaceis updated if decoder is bytelevel - otherwise leave it
- if there is no decoder, I do not add one
787ebd5 to
01dcf33
Compare
01dcf33 to
ebf781e
Compare
a3e054d to
f48f1b4
Compare
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for working on this feature!
Let's wait a bit for a simpler tokenizers API. I am planning to introduce accessing to the normalizer.Sequence's individual elements, this will make this a lot simpler!
|
#1590 supports :
|
Fixes portion of #30824
Adds support for add_prefix_space.
TASKS:
convert_slow_tokenizer.py
add_prefix_spaceis set based on original_tokenizer, otherwise based on proto's noramlizer.llama_tokenizer.py
add_prefix_spaceis updated based on normalizer if unset.llama_tokenizer_fast.py
from_slowconversion ifadd_prefix_spaceis not set (see tokenization_utils_fast.py updates on how this is handled without conversion). This allows passing this field wthout sentencepiece installed.tokenization_seamless_m4t.py
tokenization_t5.py
tokenization_t5_fast.py
tokenization_utils.fast.py
add_prefix_spacewithout converting from slow, thus allowing to use without sentencepiece installedadd_prefix_spaceand underlying normalizer. legacy logic copied fromconvert_from_slow.pyTESTS:
Reviewer:
@ArthurZucker