-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
Hello!
I would like to ask your opinion about a tokenizer behavior. In a project, I have to train a new tokenizer to re-pretrain an Albert model. I don't know if I did something wrong (and if I did, I'd love to know!) but for the moment a text is not tokenized in the same way with AlbertTokenizer and AlbertTokenizerFast.
Thanks a lot for your time in advance 😄
To reproduce
Steps to reproduce the behavior:
- Training a tokenizer with sentencepiece library. The resulting tokenizer is saved under the name
spiece.model. I can share it if needed. - Assuming that only the
spiece.modelfile is in the root. Run the following blocs of code:
tokenizer_dir_path = "."
text = "a\n b"Cell:
albert_tokenizer = AlbertTokenizer.from_pretrained(tokenizer_dir_path)
print("ids", albert_tokenizer.encode(text))
print("ids -> ids_token",albert_tokenizer.convert_ids_to_tokens(albert_tokenizer.encode(text)))Output:
ids [2, 1842, 5132, 3]
ids -> ids_token ['[CLS]', '▁a', '▁b', '[SEP]']Cell:
albert_tokenizer_fast = AlbertTokenizerFast.from_pretrained(tokenizer_dir_path)
print("ids", albert_tokenizer_fast.encode(text))
print("ids -> ids_token",albert_tokenizer_fast.convert_ids_to_tokens(albert_tokenizer_fast.encode(text)))Output:
ids [2, 1127, 266, 3157, 3]
ids -> ids_token ['[CLS]', '▁a', '▁', '▁b', '[SEP]']
Cell:
sp = spm.SentencePieceProcessor(model_file=os.path.join(tokenizer_dir_path, "spiece.model"))
print("ids", sp.encode(text))
print("ids -> ids_token", sp.id_to_piece(sp.encode(text)))Output:
ids [1127, 3157]
ids -> ids_token ['▁a', '▁b']Other variations:
I also tried to instantiate the tokenizer like this AlbertTokenizerFast(vocab_file=os.path.join(tokenizer_dir_path, "spiece.model")).
Expected behavior
I expected to have the same result with the modules: AlbertTokenizer and AlbertTokenizerFast. In particular, I did not expect "\n" to be tokenized by "_" in the case of AlbertTokenizerFast.
Environment info
transformersversion: 4.5.1- Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet ...): Albert
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)