transformers version changes the tokenization

### System Info

```
Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
```

________________

- **Platform:** Linux-6.17.0-22-generic-x86_64-with-glibc2.39
- **Python version:** 3.12.3
- **Huggingface_hub version:** 0.36.2
- **Safetensors version:** 0.7.0
- **Accelerate version:** 1.12.0
- **Accelerate config:** not found
- **DeepSpeed version:** not installed
- **Tensorflow version (GPU?):** not installed (NA)
- **Flax version (CPU?/GPU?/TPU?):** not installed (NA)
- **Jax version:** not installed
- **JaxLib version:** not installed
- **Using distributed or parallel set-up in script?:** `<fill in>`
- **Using GPU in script?:** `<fill in>`
- **GPU type:** NVIDIA GeForce RTX 5060 Laptop GPU

### Who can help?

@ArthurZucker @Cyrilvallez

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Tokenization differs according to the transformers version.

- **In 4.X:** Tokenization is normal
- **In 5.X:** Tokenization is at the character level

Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (`5.7.0` vs `4.57.6`):

```python
import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))
```

#### Outputs with transformers 5.7.0

```
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']
```

#### Outputs with transformers 4.57.6

```
transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']
```

### Expected behavior

This tokenization:

```
['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']
```

Should appear with the 5.X transformers version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers version changes the tokenization #45701

System Info

Who can help?

Information

Tasks

Reproduction

Outputs with transformers 5.7.0

Outputs with transformers 4.57.6

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

transformers version changes the tokenization #45701

Description

System Info

Who can help?

Information

Tasks

Reproduction

Outputs with transformers 5.7.0

Outputs with transformers 4.57.6

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions