Skip to content

AutoTokenizer ignores tokenizer.json from the repository #44462

@apaniukov

Description

@apaniukov

System Info

  • transformers version: 5.3.0
  • Python version: 3.10.12
  • Huggingface_hub version: 1.5.0

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

AutoTokenizer doesn't load tokenizer based on tokenizer.json from the repository. Saving this tokenizer produces different tokenizer.json file.

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

tokenizer_original.json
tokenizer_saved.json

Original normalizer/pre-tokenizer:

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

Saved normalizer/pre-tokenizer:

  "normalizer": null,
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "",
    "prepend_scheme": "always",
    "split": false
  },

Expected behavior

Original tokenizer.json should be used to instantiate the tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions