Skip to content

Different tokenization with same tokenizer from 4.57.3 to 5.0 #43122

@awni

Description

@awni

System Info

Moving from transformers 4.57.3 to 5.0+ introduces a different and seemingly incorrect tokenization when using the same tokenizer.

I believe the new version is incorrect because when using it, we get bad results (the model starts to introduce unexpected artifacts in the response).

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run the following with the two versions and compare the tokenized prompt.

from transformers import AutoTokenizer                                         
                                                                               
tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-4bit")   
messages = [                                                                   
    {                                                                          
        "role": "system",                                                      
        "content": '"You are opencode, an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.\n\nIMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes.'
    }
]
    
prompt = tokenizer.apply_chat_template(                                        
    messages, add_generation_prompt=True, tokenize=True, return_dict=False     
)
print(prompt) 

Expected behavior

They should be the same.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions