Different results between `AlbertTokenizer` and `AlbertTokenizerFast` modules with a new `spiece.model` file

Hello!

I would like to ask your opinion about a tokenizer behavior. In a project, I have to train a new tokenizer to re-pretrain an Albert model. I don't know if I did something wrong (and if I did, I'd love to know!) but for the moment a text is not tokenized in the same way with `AlbertTokenizer` and `AlbertTokenizerFast`.

Thanks a lot for your time in advance :smile: 

## To reproduce

Steps to reproduce the behavior:

1. Training a tokenizer with [sentencepiece library](https://github.com/google/sentencepiece). The resulting tokenizer is saved under the name `spiece.model`. I can share it if needed.
2. Assuming that only the `spiece.model` file is in the root. Run the following blocs of code: 
```python
tokenizer_dir_path = "."
text = "a\n b"
```
Cell:
```python
albert_tokenizer = AlbertTokenizer.from_pretrained(tokenizer_dir_path)

print("ids", albert_tokenizer.encode(text))
print("ids -> ids_token",albert_tokenizer.convert_ids_to_tokens(albert_tokenizer.encode(text)))
```
Output:
```bash
ids [2, 1842, 5132, 3]
ids -> ids_token ['[CLS]', '▁a', '▁b', '[SEP]']
```
Cell:
```python
albert_tokenizer_fast = AlbertTokenizerFast.from_pretrained(tokenizer_dir_path)

print("ids", albert_tokenizer_fast.encode(text))
print("ids -> ids_token",albert_tokenizer_fast.convert_ids_to_tokens(albert_tokenizer_fast.encode(text)))
```
Output:
```Bash
ids [2, 1127, 266, 3157, 3]
ids -> ids_token ['[CLS]', '▁a', '▁', '▁b', '[SEP]']

```
Cell:
```python
sp = spm.SentencePieceProcessor(model_file=os.path.join(tokenizer_dir_path, "spiece.model"))

print("ids", sp.encode(text))
print("ids -> ids_token", sp.id_to_piece(sp.encode(text)))
```
Output:
```bash
ids [1127, 3157]
ids -> ids_token ['▁a', '▁b']
```
Other variations:
I also tried to instantiate the tokenizer like this `AlbertTokenizerFast(vocab_file=os.path.join(tokenizer_dir_path, "spiece.model"))`.

## Expected behavior

I expected to have the same result with the modules: `AlbertTokenizer` and  `AlbertTokenizerFast`. In particular, I did not expect "\n" to be tokenized by "_" in the case of `AlbertTokenizerFast`.

## Environment info

- `transformers` version: 4.5.1
- Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help
@patrickvonplaten 


## Information

Model I am using (Bert, XLNet ...): Albert

The problem arises when using:
* [ ] the official example scripts: (give details below)
* [x] my own modified scripts: (give details below)

The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details below)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results between `AlbertTokenizer` and `AlbertTokenizerFast` modules with a new `spiece.model` file #11358

To reproduce

Expected behavior

Environment info

Who can help

Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Different results between AlbertTokenizer and AlbertTokenizerFast modules with a new spiece.model file #11358

Description

To reproduce

Expected behavior

Environment info

Who can help

Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Different results between `AlbertTokenizer` and `AlbertTokenizerFast` modules with a new `spiece.model` file #11358