System Info
transformers version: 5.3.0
- Python version: 3.10.12
- Huggingface_hub version: 1.5.0
Who can help?
@ArthurZucker and @itazap
Information
Tasks
Reproduction
AutoTokenizer doesn't load tokenizer based on tokenizer.json from the repository. Saving this tokenizer produces different tokenizer.json file.
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")
tokenizer_original.json
tokenizer_saved.json
Original normalizer/pre-tokenizer:
"normalizer": {
"type": "Sequence",
"normalizers": []
},
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "[\r\n]"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "\\s?\\p{L}+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "\\s?\\p{P}+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "[一-龥ࠀ-一가-]+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Digits",
"individual_digits": true
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
}
Saved normalizer/pre-tokenizer:
"normalizer": null,
"pre_tokenizer": {
"type": "Metaspace",
"replacement": "▁",
"prepend_scheme": "always",
"split": false
},
Expected behavior
Original tokenizer.json should be used to instantiate the tokenizer.
System Info
transformersversion: 5.3.0Who can help?
@ArthurZucker and @itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
AutoTokenizerdoesn't load tokenizer based ontokenizer.jsonfrom the repository. Saving this tokenizer produces differenttokenizer.jsonfile.tokenizer_original.json
tokenizer_saved.json
Original normalizer/pre-tokenizer:
Saved normalizer/pre-tokenizer:
Expected behavior
Original
tokenizer.jsonshould be used to instantiate the tokenizer.