Skip to content

convert_tokens_to_string does not conform to its signatureΒ #16525

@inspiralpatterns

Description

@inspiralpatterns

Environment info

  • transformers version: 4.17.0
  • Platform: macOS-11.6.4-x86_64-i386-64bit
  • Python version: 3.9.10
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False

Who can help

@SaulLu

Information

Model I am using (Bert, XLNet ...): AutoModelForQuestionAnswering

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: Question Answering
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Using the official example script (I will omit it, I will just post the result):

Question: How many pretrained models are available in πŸ€— Transformers?
Answer: ['over', ' 32', ' +']
Question: What does πŸ€— Transformers provide?
Answer: ['general', ' -', ' purpose', ' architecture', 's']
Question: πŸ€— Transformers provides interoperability between which frameworks?
Answer: ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch']

Using the model in our context:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = "Hello my browser is not working, I need help."
questions = [
    "What is the issue?",
    "What is the request?",
]


def extract_answer_idxs(start_logits, end_logits):
    answer_start = torch.argmax(start_logits)
    answer_end = torch.argmax(end_logits) + 1
    return answer_start, answer_end

text = [text] * len(questions)
inputs = tokenizer(questions, text, add_special_tokens=True, return_tensors="pt", max_length=512, truncation=True)
input_ids = inputs["input_ids"].tolist()
outputs = model(**inputs)
idxs = map(
        lambda x, y: extract_answer_idxs(x, y),
        outputs.start_logits,
        outputs.end_logits,
)
answers = list(
    map(
        lambda x, y: tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(x[y[0]:y[1]])),
        input_ids,
        (idx for idx in idxs),
    )
)

print(f"Questions: {questions}")
print(f"Answers: {answers}")

Result:

Questions: ['What is the issue?', 'What is the request?']
Answers: [['my', ' browser', ' is', ' not', ' working'], ['help']]

(I also tried in a loop fashion and I get the same identical result.)

Expected behavior

Questions: ['What is the issue?', 'What is the request?']
Answers: ['my browser is not working', 'help']

As the docs show, I expect a string and not a list of tokens.
Please notice how whitespaces are somehow introduced in some of the tokens.
Furthermore, some tokens are split e.g. ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch'].

I expect convert_tokens_to_string to return a str, as it was previously.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions