Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
cc @Narsil for visibility! |
…to fix-llama-tokenizer
|
This will need to wait for #22341 |
|
Yes, on it! |
|
Will finish this tomorrow! |
|
Hi! Does this PR the decoding part of the tokenizer? Seems like it always prefixes the output with space. For instance, |
Waiting for huggingface/transformers#22402 to fix llama tokenizer
|
Yes, it does: |
| tokenizer = BertTokenizer.from_pretrained(tmp_dir_2) | ||
|
|
||
| assert tokenizer_fast.clean_up_tokenization_spaces is False | ||
| assert tokenizer.clean_up_tokenization_spaces is False |
There was a problem hiding this comment.
this is such a small nit that I included it 😅
There was a problem hiding this comment.
After rebasing, this test fails for me :( just reproduced on main:
> assert decoded == "[CLS] this shouldn ' t be! he ' ll go. [SEP]"
E assert "[CLS] this s...'ll go. [SEP]" == "[CLS] this s... ll go. [SEP]"
E - [CLS] this shouldn ' t be! he ' ll go. [SEP]
E ? - - - -
E + [CLS] this shouldn't be! he'll go. [SEP]
There was a problem hiding this comment.
this is not pointing to the correct part of the test. If the cleanup_tokenization_spaces is indeed False, the fail can happen for cache reasons or anything else (also failed for me at some point).
Will check again
sgugger
left a comment
There was a problem hiding this comment.
Nice, thanks for all the fixes and for adding the tests!
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
What does this PR do?
Draft but: