Skip to content

transformers version changes the tokenization #45701

@PiRom1

Description

@PiRom1

System Info

Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1

  • Platform: Linux-6.17.0-22-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.36.2
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA GeForce RTX 5060 Laptop GPU

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Tokenization differs according to the transformers version.

  • In 4.X: Tokenization is normal
  • In 5.X: Tokenization is at the character level

Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (5.7.0 vs 4.57.6):

import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))

Outputs with transformers 5.7.0

transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']

Outputs with transformers 4.57.6

transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Expected behavior

This tokenization:

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Should appear with the 5.X transformers version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions