Deprecate `clean_up_tokenization_spaces` for BLOOM by thomasw21 · Pull Request #20846 · huggingface/transformers

thomasw21 · 2022-12-20T11:24:13Z

What does this PR do?

Currently in transformers:

>>> tok.decode(tok.encode("Hello , there"))
'Hello, there' # notice the missing space between "Hello" and ","
>>> tok.decode(tok.encode("Hello , there"), clean_up_tokenization_spaces=False)
'Hello , there'

In order too prevent issues such as this: https://huggingface.co/bigscience/bloom/discussions/153#6397907b71eb2455d898e0a4 we suggest to add a warning, suggesting to users to use clean_up_tokenization_spaces=False instead.

As BLOOM tokenizer was developped in order to be lossless encoding mechanism, it should make sense to always remove that option IMO, therefore I'm suggesting to deprecate that option from BLOOM tokenizer. Other option would be to change the default to True.

HuggingFaceDocBuilderDev · 2022-12-20T11:51:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker

Thanks for this!

src/transformers/models/bloom/tokenization_bloom_fast.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

LysandreJik · 2022-12-20T15:35:08Z

Thanks for your PR! After thinking a little more about it and in terms of user experience, I'm happy to have the warning if you think the use-case is frequent and the default behavior is misleading.

However, I'm not too sure about deprecating/updating the value in v5. I think the current behavior isn't necessarily a bug, as the argument to toggle is clearly displayed in the docs (and I have no problem with making it more prominent, such as with the warning). Switching to False means that we'll start diverging between BLOOM and other tokenizers (like GPT-2) which work very similarly as of now.

I'd be in favor of adding the warning mentioning to toggle it in this PR, and to wait until @sgugger is back so that we have a second opinion on the matter before mentioning that we will move it to False by default. Would that be ok for you @thomasw21?

thomasw21 · 2022-12-20T15:51:59Z

@LysandreJik Sure! This isn't blocking anything really, the real issue is here: huggingface/text-generation-inference#12

IMO as the tokenizer was build to be lossless, it's weird that by default it isn't. Would it make more sense to move clean_up_tokenization_spaces to be in tokenizer instead? Something like a special decoder? https://huggingface.co/docs/tokenizers/components#decoders . I understand that this is breaking, but we should be able to slightly migrate to newer setups using deprecation cycles?

LysandreJik · 2022-12-20T16:20:05Z

Interesting proposal, WDYT @Narsil?

Narsil · 2022-12-20T16:48:20Z

I think it's ok to move slowly, but touching cleanup_tokenization_spaces and its default are BIG changes.

Personally, I think borderline too big to migrate in V5 (it's just a really big change, that's unfortunately probably not worth the effort).

That being said, making it modifiable on a tokenizer per tokenizer basis (so updating Bloom alone) is still Ok, and is definitely a good way forward.

Personally I would focus on this user's need first, which would be solved by implementing return_full_text=False, it seems the lowest hanging fruit to solve the user's need. We can move forward on the "decoder" (or any other type of config change) later.

sgugger

Thanks for the PR. I agree with @LysandreJik on the fact that we would probably not change the default even in v5, so the deprecation warning makes little sense. If it's possible to have it built in the fast tokenizer directly and then update repos on a case-by-case basis, I'd be more in favor of that approach.

Left comments on the docstrings in case this PR moves forward :-)

sgugger · 2022-12-21T07:11:53Z

src/transformers/models/bloom/tokenization_bloom_fast.py

+        """
+        Convert a list of lists of token ids into a list of strings by calling decode.
+
+        Args:
+            sequences (`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+
+        Returns:
+            `List[str]`: The list of decoded sentences.
+        """


Remove the docstring, so it uses the docstring of the superclass (and we never have to worry about it getting outdated).

sgugger · 2022-12-21T07:12:14Z

src/transformers/models/bloom/tokenization_bloom_fast.py

+        """
+        Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
+        tokens and clean up tokenization spaces.
+
+        Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.
+
+        Args:
+            token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+
+        Returns:
+            `str`: The decoded sentence.
+        """


thomasw21 · 2022-12-21T09:45:48Z

Okay so in terms of actions:

Assume it's not a transformers bug but a text-generation-inference bug right now. Causal LM modifies the input when returning text text-generation-inference#12
Start thinking of a way to support clean_up_tokenization_spaces and skip_special_tokens in the tokenizer directly? Typically you want in order of priority: user defined argument, tokenizer specific config, methods default

Would that make sense?

github-actions · 2023-01-19T15:02:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thomasw21 added 4 commits December 20, 2022 12:12

Add warning for BLOOM tokenizer

8ae0f3c

Hit batch_decode and decode instead of _decoder

3aeaeb4

Make quality

f8ba605

Probably a bit fast in deprecating it

b18baa0

thomasw21 marked this pull request as ready for review December 20, 2022 11:52

thomasw21 requested review from ArthurZucker and LysandreJik December 20, 2022 11:53

ArthurZucker approved these changes Dec 20, 2022

View reviewed changes

src/transformers/models/bloom/tokenization_bloom_fast.py Outdated Show resolved Hide resolved

src/transformers/models/bloom/tokenization_bloom_fast.py Outdated Show resolved Hide resolved

thomasw21 and others added 2 commits December 20, 2022 15:36

Update src/transformers/models/bloom/tokenization_bloom_fast.py

bc98246

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Use warnings instead

0042066

sgugger reviewed Dec 21, 2022

View reviewed changes

thomasw21 mentioned this pull request Dec 21, 2022

Easiest fix. huggingface/text-generation-inference#13

Merged

github-actions bot closed this Jan 28, 2023

thomasw21 deleted the thomas/deprecate_clean_tokenization_spaces_for_bloom branch January 29, 2023 21:15

Narsil mentioned this pull request Feb 21, 2023

GPT-2 tokeniser's decoder is incorrect and doesn't roundtrip huggingface/tokenizers#1164

Closed

Narsil mentioned this pull request Mar 8, 2023

clean_up_tokenization too many false positives #22016

Closed

4 tasks

ArthurZucker mentioned this pull request Mar 24, 2023

Add clean_up_tokenization_spaces to config #22341

Merged

Narsil mentioned this pull request May 3, 2023

LlamaTokenizer adds space when decoding <s> huggingface/tokenizers#1237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate `clean_up_tokenization_spaces` for BLOOM#20846

Deprecate `clean_up_tokenization_spaces` for BLOOM#20846
thomasw21 wants to merge 6 commits intomainfrom
thomas/deprecate_clean_tokenization_spaces_for_bloom

thomasw21 commented Dec 20, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Dec 20, 2022

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

LysandreJik commented Dec 20, 2022

Uh oh!

thomasw21 commented Dec 20, 2022 •

edited

Loading

Uh oh!

LysandreJik commented Dec 20, 2022

Uh oh!

Narsil commented Dec 20, 2022

Uh oh!

sgugger left a comment

Uh oh!

sgugger Dec 21, 2022

Uh oh!

sgugger Dec 21, 2022

Uh oh!

thomasw21 commented Dec 21, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Jan 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

thomasw21 commented Dec 20, 2022

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 20, 2022

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LysandreJik commented Dec 20, 2022

Uh oh!

thomasw21 commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik commented Dec 20, 2022

Uh oh!

Narsil commented Dec 20, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

thomasw21 commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

thomasw21 commented Dec 20, 2022 •

edited

Loading

thomasw21 commented Dec 21, 2022 •

edited

Loading