Deprecate clean_up_tokenization_spaces for BLOOM#20846
Deprecate clean_up_tokenization_spaces for BLOOM#20846
clean_up_tokenization_spaces for BLOOM#20846Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
|
Thanks for your PR! After thinking a little more about it and in terms of user experience, I'm happy to have the warning if you think the use-case is frequent and the default behavior is misleading. However, I'm not too sure about deprecating/updating the value in v5. I think the current behavior isn't necessarily a bug, as the argument to toggle is clearly displayed in the docs (and I have no problem with making it more prominent, such as with the warning). Switching to I'd be in favor of adding the warning mentioning to toggle it in this PR, and to wait until @sgugger is back so that we have a second opinion on the matter before mentioning that we will move it to |
|
@LysandreJik Sure! This isn't blocking anything really, the real issue is here: huggingface/text-generation-inference#12 IMO as the tokenizer was build to be lossless, it's weird that by default it isn't. Would it make more sense to move |
|
Interesting proposal, WDYT @Narsil? |
|
I think it's ok to move slowly, but touching Personally, I think borderline too big to migrate in V5 (it's just a really big change, that's unfortunately probably not worth the effort). That being said, making it modifiable on a tokenizer per tokenizer basis (so updating Bloom alone) is still Ok, and is definitely a good way forward. Personally I would focus on this user's need first, which would be solved by implementing |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for the PR. I agree with @LysandreJik on the fact that we would probably not change the default even in v5, so the deprecation warning makes little sense. If it's possible to have it built in the fast tokenizer directly and then update repos on a case-by-case basis, I'd be more in favor of that approach.
Left comments on the docstrings in case this PR moves forward :-)
| """ | ||
| Convert a list of lists of token ids into a list of strings by calling decode. | ||
|
|
||
| Args: | ||
| sequences (`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`): | ||
| List of tokenized input ids. Can be obtained using the `__call__` method. | ||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | ||
| Whether or not to remove special tokens in the decoding. | ||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | ||
| Whether or not to clean up the tokenization spaces. | ||
| kwargs (additional keyword arguments, *optional*): | ||
| Will be passed to the underlying model specific decode method. | ||
|
|
||
| Returns: | ||
| `List[str]`: The list of decoded sentences. | ||
| """ |
There was a problem hiding this comment.
Remove the docstring, so it uses the docstring of the superclass (and we never have to worry about it getting outdated).
| """ | ||
| Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special | ||
| tokens and clean up tokenization spaces. | ||
|
|
||
| Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`. | ||
|
|
||
| Args: | ||
| token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): | ||
| List of tokenized input ids. Can be obtained using the `__call__` method. | ||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | ||
| Whether or not to remove special tokens in the decoding. | ||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | ||
| Whether or not to clean up the tokenization spaces. | ||
| kwargs (additional keyword arguments, *optional*): | ||
| Will be passed to the underlying model specific decode method. | ||
|
|
||
| Returns: | ||
| `str`: The decoded sentence. | ||
| """ |
|
Okay so in terms of actions:
Would that make sense? |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Currently in
transformers:In order too prevent issues such as this: https://huggingface.co/bigscience/bloom/discussions/153#6397907b71eb2455d898e0a4 we suggest to add a warning, suggesting to users to use
clean_up_tokenization_spaces=Falseinstead.As BLOOM tokenizer was developped in order to be lossless encoding mechanism, it should make sense to always remove that option IMO, therefore I'm suggesting to deprecate that option from BLOOM tokenizer. Other option would be to change the default to
True.