Skip to content

Deberta Tokenizer convert_ids_to_tokens() is not giving expected results #10258

@bhadreshpsavani

Description

@bhadreshpsavani

Environment info

  • transformers version: 4.3.0
  • Platform: Colab
  • Python version: 3.9
  • PyTorch version (GPU?): No
  • Tensorflow version (GPU?): No
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Information

I am using Deberta Tokenizer. convert_ids_to_tokens() of the tokenizer is not working fine.

The problem arises when using:

  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset

To reproduce

Steps to reproduce the behavior:

  1. Get Debrta Tokenizer
from transformers import DebertaTokenizer
deberta_tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base')
  1. Encode Some Example Using Tokenizer
example = "Hi I am Bhadresh. I found an issue in Deberta Tokenizer"
encoded_example = distilbert_tokenizer.encode(example)
  1. Convert Ids to tokens:
distilbert_tokenizer.convert_ids_to_tokens(encoded_example )
"""
Output: ['[CLS]', '17250', '314', '716', '16581', '324', '3447', '13', '314', '1043', '281', '2071', '287', '1024', '4835', '64', '29130', '7509', '[SEP]']
"""

Colab Link For Reproducing

Expected behavior

It should return some tokens like this

['[CLS]', 'hi', 'i', 'am', 'b', '##had', '##resh', '.', 'i', 'found', 'an', 'issue', 'in', 'de', '##bert', '##a', 'token', '##izer', '[SEP]']

Not just convert an integer to string like the current behavior

Tagging SMEs for help:

@n1t0, @LysandreJik

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions