Two tokenizer initialization methods result in inconsistent segmentation results for special words

### System Info

transformers==4.17.0
torch==1.10.0
python==3.7.3

### Who can help?

@ArthurZucker

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction
```
# xlm-roberta-base directory: git clone https://huggingface.co/xlm-roberta-base
from transformers import XLMRobertaTokenizer
tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base/')
tokenizer_b = XLMRobertaTokenizer('xlm-roberta-base/sentencepiece.bpe.model')

t = 'texta<s>textb'
print(tokenizer_a.tokenize(t))
print(tokenizer_b.tokenize(t))
```

### Expected behavior

```
# what I expect is that both outputs: 
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<s>', '▁text', 'b']

# However, in reality, their outputs are as follows:
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<', 's', '>', 'text', 'b']
```


Why these two tokenizers have different segmentation results for special words?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two tokenizer initialization methods result in inconsistent segmentation results for special words #23930

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Two tokenizer initialization methods result in inconsistent segmentation results for special words #23930

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions