System Info
Moving from transformers 4.57.3 to 5.0+ introduces a different and seemingly incorrect tokenization when using the same tokenizer.
I believe the new version is incorrect because when using it, we get bad results (the model starts to introduce unexpected artifacts in the response).
Who can help?
@ArthurZucker
Information
Tasks
Reproduction
Run the following with the two versions and compare the tokenized prompt.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-4bit")
messages = [
{
"role": "system",
"content": '"You are opencode, an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.\n\nIMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes.'
}
]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=False
)
print(prompt)
Expected behavior
They should be the same.
System Info
Moving from transformers 4.57.3 to 5.0+ introduces a different and seemingly incorrect tokenization when using the same tokenizer.
I believe the new version is incorrect because when using it, we get bad results (the model starts to introduce unexpected artifacts in the response).
Who can help?
@ArthurZucker
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Run the following with the two versions and compare the tokenized prompt.
Expected behavior
They should be the same.