Skip to content

Different results between AlbertTokenizer and AlbertTokenizerFast modules with a new spiece.model file #11358

@SaulLu

Description

@SaulLu

Hello!

I would like to ask your opinion about a tokenizer behavior. In a project, I have to train a new tokenizer to re-pretrain an Albert model. I don't know if I did something wrong (and if I did, I'd love to know!) but for the moment a text is not tokenized in the same way with AlbertTokenizer and AlbertTokenizerFast.

Thanks a lot for your time in advance 😄

To reproduce

Steps to reproduce the behavior:

  1. Training a tokenizer with sentencepiece library. The resulting tokenizer is saved under the name spiece.model. I can share it if needed.
  2. Assuming that only the spiece.model file is in the root. Run the following blocs of code:
tokenizer_dir_path = "."
text = "a\n b"

Cell:

albert_tokenizer = AlbertTokenizer.from_pretrained(tokenizer_dir_path)

print("ids", albert_tokenizer.encode(text))
print("ids -> ids_token",albert_tokenizer.convert_ids_to_tokens(albert_tokenizer.encode(text)))

Output:

ids [2, 1842, 5132, 3]
ids -> ids_token ['[CLS]', '▁a', '▁b', '[SEP]']

Cell:

albert_tokenizer_fast = AlbertTokenizerFast.from_pretrained(tokenizer_dir_path)

print("ids", albert_tokenizer_fast.encode(text))
print("ids -> ids_token",albert_tokenizer_fast.convert_ids_to_tokens(albert_tokenizer_fast.encode(text)))

Output:

ids [2, 1127, 266, 3157, 3]
ids -> ids_token ['[CLS]', '▁a', '', '▁b', '[SEP]']

Cell:

sp = spm.SentencePieceProcessor(model_file=os.path.join(tokenizer_dir_path, "spiece.model"))

print("ids", sp.encode(text))
print("ids -> ids_token", sp.id_to_piece(sp.encode(text)))

Output:

ids [1127, 3157]
ids -> ids_token ['▁a', '▁b']

Other variations:
I also tried to instantiate the tokenizer like this AlbertTokenizerFast(vocab_file=os.path.join(tokenizer_dir_path, "spiece.model")).

Expected behavior

I expected to have the same result with the modules: AlbertTokenizer and AlbertTokenizerFast. In particular, I did not expect "\n" to be tokenized by "_" in the case of AlbertTokenizerFast.

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (False)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten

Information

Model I am using (Bert, XLNet ...): Albert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions