Tokenization differs according to the transformers version.
Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (5.7.0 vs 4.57.6):
transformers: 5.7.0
torch : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']
transformers: 4.57.6
torch : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']
['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']
Should appear with the 5.X transformers version.
System Info
<fill in><fill in>Who can help?
@ArthurZucker @Cyrilvallez
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Tokenization differs according to the transformers version.
Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (
5.7.0vs4.57.6):Outputs with transformers 5.7.0
Outputs with transformers 4.57.6
Expected behavior
This tokenization:
Should appear with the 5.X transformers version.