[Wav2Vec2] Fix tokenizer set lang#26349
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
My space suddenly had a runtime error because of this. |
|
Sorry to hear that @andergisomon - note that running |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks, I'd rather remove the addition of these tokens to unique_no_split_tokens as they won't be used.( #26322 was on it's way)
The issue is that adding them to self.unique_no_split_tokens just does nothing. A fix that would keep the previous behaviour is to rather do self.add_tokens(self.encoder.keys()). It's a bit inefficient but will make sure they are not split!
|
Thanks for the review! Applied the suggestions from your comment - look good to you? |
ArthurZucker
left a comment
There was a problem hiding this comment.
Let's simplify a bit and good to go
There was a problem hiding this comment.
add_tokens loops over the tokens, so I'd recommend adding them using:
| for token in self.encoder.keys(): | |
| if len(token) > 1: | |
| self.unique_no_split_tokens.append(token) | |
| self.add_tokens(token) | |
| self.add_tokens(self.encoder.keys()) |
as an internal checks makes sure let(token)>1
There was a problem hiding this comment.
But we need lstrip=rstrip=True here, so need to use AddedToken(lstrip=True, rstrip=True) as we had before. If that's ok with you, I'll merge
dbbdf3f to
7d62f6e
Compare
What does this PR do?
The PR #23909 removed
unique_no_split_tokensas an attribute of the Wav2Vec2 tokenizer, however it is required to set the language in the method.set_target_lang:transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
Line 237 in dcbfd93
Thus, calling
.set_target_langcurrently throws an error:Output:
This PR re-instates
unique_no_split_tokensas an attribute of the tokenizer. Note that this should already be tested for in the following test:transformers/tests/models/wav2vec2/test_tokenization_wav2vec2.py
Line 798 in dcbfd93