-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Description
System Info
transformersversion: 4.24.0- Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.12.1+cu102 (True)
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The lists NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI contain the tokens 6 and 12 that are not suppressed by default in the reference implementation.
Consider the following example using the reference whisper module:
import transformers
from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(multilingual=True, task="transcribe", language="fr")
suppress_tokens = list(
sorted(
tokenizer.non_speech_tokens
+ (tokenizer.sot, tokenizer.sot_prev, tokenizer.sot_lm, tokenizer.no_speech)
)
)
config = transformers.WhisperConfig.from_pretrained("openai/whisper-tiny")
print(suppress_tokens == config.suppress_tokens) # prints False
config.suppress_tokens.remove(6)
config.suppress_tokens.remove(12)
print(suppress_tokens == config.suppress_tokens) # prints TrueExpected behavior
The list of suppressed tokens should match the reference implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels