Skip to content

Fix SPM conversions#686

Merged
n1t0 merged 2 commits intomasterfrom
fix-spm-conversion
May 20, 2021
Merged

Fix SPM conversions#686
n1t0 merged 2 commits intomasterfrom
fix-spm-conversion

Conversation

@LysandreJik
Copy link
Member

This PR fixes an issue with the SPM converters (ALBERT and XLNet) where it would replace some characters by whitespace - after removing double whitespace occurrences. This meant that if double whitespace were to appear thanks to this replacement, they would be kept until the end of encoding, leading to a mismatch between SentencePiece and tokenizers.

Fixes huggingface/transformers#11358

@LysandreJik LysandreJik requested a review from n1t0 April 21, 2021 22:12
Copy link
Contributor

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this @LysandreJik!

@n1t0 n1t0 force-pushed the fix-spm-conversion branch from c2aafa6 to 519c052 Compare May 20, 2021 13:37
@n1t0 n1t0 merged commit 4b0dc6b into master May 20, 2021
@n1t0 n1t0 deleted the fix-spm-conversion branch May 20, 2021 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Different results between AlbertTokenizer and AlbertTokenizerFast modules with a new spiece.model file

2 participants