LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast)

### System Info

- `transformers` version: 4.30.2
- Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no


### Who can help?

@ArthurZucker @youn

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Reproduction

Comparing slow and fast `LlamaTokenizer` instances with `huggyllama/llama-7b`.

```
from transformers import AutoTokenizer

model = "huggyllama/llama-7b"

fast = AutoTokenizer.from_pretrained(model)
slow = AutoTokenizer.from_pretrained(model, use_fast=False)

# use tokenize()
print(fast.tokenize("<s>uns"), slow.tokenize("<s>uns"))
# -> (['▁<s>', 'uns'], ['<s>', '▁uns'])

# use __call__
print(fast(f"{fast.bos_token}uns", add_special_tokens=False), slow(f"{slow.bos_token}uns", add_special_tokens=False))
# -> ({'input_ids': [1, 6948], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]},
#     {'input_ids': [1, 9644], 'attention_mask': [1, 1]})

# round-tripping
print(fast.convert_tokens_to_string(fast.tokenize("<s>uns")), fast.convert_tokens_to_string(slow.tokenize("<s>uns")))
# -> ('<s>uns', '<s> uns')
```

### Expected behavior

It looks like the slow LlamaTokenizer wrongly tokenises `uns`. I would not expect the additional whitespace when round-tripping or when tokenising in the first place.

Thanks a lot in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

System Info

Who can help?

Information

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Description

System Info

Who can help?

Information

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions