Skip to content

Add clean_up_tokenization_spaces to config#22341

Merged
ArthurZucker merged 12 commits intohuggingface:mainfrom
ArthurZucker:fix-cleanup-tokenization
Mar 29, 2023
Merged

Add clean_up_tokenization_spaces to config#22341
ArthurZucker merged 12 commits intohuggingface:mainfrom
ArthurZucker:fix-cleanup-tokenization

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

What does this PR do?

DRAFT

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Mar 23, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

cc @Narsil

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

ArthurZucker commented Mar 24, 2023

Also linked to #20846. A follow up PR can now be made to add a simple warning if the default value is set to True / put the default value to True

@ArthurZucker ArthurZucker marked this pull request as ready for review March 24, 2023 10:57
@ArthurZucker ArthurZucker requested review from Narsil and sgugger March 24, 2023 10:57
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

It seems like no model defaulted to use cleanup_tokenization_spaces = False so this should be seamless. Otherwise the tokenizer's __init__ should be updated

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! LGTM with the change in the doc suggested and the additional test suggested by Narsil.

Please make sure to fill the description before merging.

Copy link
Copy Markdown
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice readable test !

ArthurZucker and others added 4 commits March 28, 2023 12:14
Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>
Copy link
Copy Markdown
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ArthurZucker ArthurZucker merged commit 8d9c383 into huggingface:main Mar 29, 2023
raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023
* add draft changes

* fix failing wav2vec

* style

* make sure that the argument is saved + add tests

* style

* fixup

* update test

* default clean_up_tokenization_spaces to False for Bloom and Llama

* Update code based on review

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>

* style

* quality

---------

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>
xloem pushed a commit to xloem/transformers that referenced this pull request Apr 10, 2023
* add draft changes

* fix failing wav2vec

* style

* make sure that the argument is saved + add tests

* style

* fixup

* update test

* default clean_up_tokenization_spaces to False for Bloom and Llama

* Update code based on review

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>

* style

* quality

---------

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* add draft changes

* fix failing wav2vec

* style

* make sure that the argument is saved + add tests

* style

* fixup

* update test

* default clean_up_tokenization_spaces to False for Bloom and Llama

* Update code based on review

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>

* style

* quality

---------

Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants