Skip to content

Deprecate clean_up_tokenization_spaces for BLOOM#20846

Closed
thomasw21 wants to merge 6 commits intomainfrom
thomas/deprecate_clean_tokenization_spaces_for_bloom
Closed

Deprecate clean_up_tokenization_spaces for BLOOM#20846
thomasw21 wants to merge 6 commits intomainfrom
thomas/deprecate_clean_tokenization_spaces_for_bloom

Conversation

@thomasw21
Copy link
Copy Markdown
Contributor

What does this PR do?

Currently in transformers:

>>> tok.decode(tok.encode("Hello , there"))
'Hello, there' # notice the missing space between "Hello" and ","
>>> tok.decode(tok.encode("Hello , there"), clean_up_tokenization_spaces=False)
'Hello , there'

In order too prevent issues such as this: https://huggingface.co/bigscience/bloom/discussions/153#6397907b71eb2455d898e0a4 we suggest to add a warning, suggesting to users to use clean_up_tokenization_spaces=False instead.

As BLOOM tokenizer was developped in order to be lossless encoding mechanism, it should make sense to always remove that option IMO, therefore I'm suggesting to deprecate that option from BLOOM tokenizer. Other option would be to change the default to True.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@thomasw21 thomasw21 marked this pull request as ready for review December 20, 2022 11:52
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

thomasw21 and others added 2 commits December 20, 2022 15:36
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@LysandreJik
Copy link
Copy Markdown
Member

Thanks for your PR! After thinking a little more about it and in terms of user experience, I'm happy to have the warning if you think the use-case is frequent and the default behavior is misleading.

However, I'm not too sure about deprecating/updating the value in v5. I think the current behavior isn't necessarily a bug, as the argument to toggle is clearly displayed in the docs (and I have no problem with making it more prominent, such as with the warning). Switching to False means that we'll start diverging between BLOOM and other tokenizers (like GPT-2) which work very similarly as of now.

I'd be in favor of adding the warning mentioning to toggle it in this PR, and to wait until @sgugger is back so that we have a second opinion on the matter before mentioning that we will move it to False by default. Would that be ok for you @thomasw21?

@thomasw21
Copy link
Copy Markdown
Contributor Author

thomasw21 commented Dec 20, 2022

@LysandreJik Sure! This isn't blocking anything really, the real issue is here: huggingface/text-generation-inference#12

IMO as the tokenizer was build to be lossless, it's weird that by default it isn't. Would it make more sense to move clean_up_tokenization_spaces to be in tokenizer instead? Something like a special decoder? https://huggingface.co/docs/tokenizers/components#decoders . I understand that this is breaking, but we should be able to slightly migrate to newer setups using deprecation cycles?

@LysandreJik
Copy link
Copy Markdown
Member

Interesting proposal, WDYT @Narsil?

@Narsil
Copy link
Copy Markdown
Contributor

Narsil commented Dec 20, 2022

I think it's ok to move slowly, but touching cleanup_tokenization_spaces and its default are BIG changes.

Personally, I think borderline too big to migrate in V5 (it's just a really big change, that's unfortunately probably not worth the effort).

That being said, making it modifiable on a tokenizer per tokenizer basis (so updating Bloom alone) is still Ok, and is definitely a good way forward.

Personally I would focus on this user's need first, which would be solved by implementing return_full_text=False, it seems the lowest hanging fruit to solve the user's need. We can move forward on the "decoder" (or any other type of config change) later.

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I agree with @LysandreJik on the fact that we would probably not change the default even in v5, so the deprecation warning makes little sense. If it's possible to have it built in the fast tokenizer directly and then update repos on a case-by-case basis, I'd be more in favor of that approach.

Left comments on the docstrings in case this PR moves forward :-)

Comment on lines +176 to +191
"""
Convert a list of lists of token ids into a list of strings by calling decode.

Args:
sequences (`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.

Returns:
`List[str]`: The list of decoded sentences.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the docstring, so it uses the docstring of the superclass (and we never have to worry about it getting outdated).

Comment on lines +213 to +231
"""
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
tokens and clean up tokenization spaces.

Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.

Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.

Returns:
`str`: The decoded sentence.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@thomasw21
Copy link
Copy Markdown
Contributor Author

thomasw21 commented Dec 21, 2022

Okay so in terms of actions:

Would that make sense?

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants