-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Closed
Description
System Info
transformersversion: 4.30.2- Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Reproduction
Comparing slow and fast LlamaTokenizer instances with huggyllama/llama-7b.
from transformers import AutoTokenizer
model = "huggyllama/llama-7b"
fast = AutoTokenizer.from_pretrained(model)
slow = AutoTokenizer.from_pretrained(model, use_fast=False)
# use tokenize()
print(fast.tokenize("<s>uns"), slow.tokenize("<s>uns"))
# -> (['▁<s>', 'uns'], ['<s>', '▁uns'])
# use __call__
print(fast(f"{fast.bos_token}uns", add_special_tokens=False), slow(f"{slow.bos_token}uns", add_special_tokens=False))
# -> ({'input_ids': [1, 6948], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]},
# {'input_ids': [1, 9644], 'attention_mask': [1, 1]})
# round-tripping
print(fast.convert_tokens_to_string(fast.tokenize("<s>uns")), fast.convert_tokens_to_string(slow.tokenize("<s>uns")))
# -> ('<s>uns', '<s> uns')
Expected behavior
It looks like the slow LlamaTokenizer wrongly tokenises uns. I would not expect the additional whitespace when round-tripping or when tokenising in the first place.
Thanks a lot in advance.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels