Skip to content

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

@lbeurerkellner

Description

@lbeurerkellner

System Info

  • transformers version: 4.30.2
  • Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.14.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @youn

Information

  • The official example scripts
  • My own modified scripts

Reproduction

Comparing slow and fast LlamaTokenizer instances with huggyllama/llama-7b.

from transformers import AutoTokenizer

model = "huggyllama/llama-7b"

fast = AutoTokenizer.from_pretrained(model)
slow = AutoTokenizer.from_pretrained(model, use_fast=False)

# use tokenize()
print(fast.tokenize("<s>uns"), slow.tokenize("<s>uns"))
# -> (['▁<s>', 'uns'], ['<s>', '▁uns'])

# use __call__
print(fast(f"{fast.bos_token}uns", add_special_tokens=False), slow(f"{slow.bos_token}uns", add_special_tokens=False))
# -> ({'input_ids': [1, 6948], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]},
#     {'input_ids': [1, 9644], 'attention_mask': [1, 1]})

# round-tripping
print(fast.convert_tokens_to_string(fast.tokenize("<s>uns")), fast.convert_tokens_to_string(slow.tokenize("<s>uns")))
# -> ('<s>uns', '<s> uns')

Expected behavior

It looks like the slow LlamaTokenizer wrongly tokenises uns. I would not expect the additional whitespace when round-tripping or when tokenising in the first place.

Thanks a lot in advance.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions