Skip to content

Wrong tokenizer decoder type in Transformers v5 #43066

@awni

Description

@awni

System Info

Wrong decoder type with 5.0.0rc1.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
print(tokenizer.decoder)

You get:

Sequence(decoders=[Replace(pattern=String("▁"), content=" "), ByteFallback(), Fuse(), Strip(content=" ", start=1, stop=0)])

with Transformers v5.

And with Transformers 4.57.3 and earlier you get:

ByteLevel(add_prefix_space=True, trim_offsets=True, use_regex=True)

Is it expected that this changed?

Expected behavior

The same decoder type as transformers 4.57.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions