Skip to content

Whisper: incorrect list of non speech tokens #20123

@guillaumekln

Description

@guillaumekln

System Info

  • transformers version: 4.24.0
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.12.1+cu102 (True)

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The lists NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI contain the tokens 6 and 12 that are not suppressed by default in the reference implementation.

Consider the following example using the reference whisper module:

import transformers
from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(multilingual=True, task="transcribe", language="fr")

suppress_tokens = list(
    sorted(
        tokenizer.non_speech_tokens
        + (tokenizer.sot, tokenizer.sot_prev, tokenizer.sot_lm, tokenizer.no_speech)
    )
)

config = transformers.WhisperConfig.from_pretrained("openai/whisper-tiny")
print(suppress_tokens == config.suppress_tokens)  # prints False

config.suppress_tokens.remove(6)
config.suppress_tokens.remove(12)
print(suppress_tokens == config.suppress_tokens)  # prints True

Expected behavior

The list of suppressed tokens should match the reference implementation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions