Skip to content

Fix bad_words_ids not working with sentencepiece-based tokenizers#15343

Merged
patrickvonplaten merged 2 commits intohuggingface:masterfrom
ngoquanghuy99:patch-1
Jan 28, 2022
Merged

Fix bad_words_ids not working with sentencepiece-based tokenizers#15343
patrickvonplaten merged 2 commits intohuggingface:masterfrom
ngoquanghuy99:patch-1

Conversation

@ngoquanghuy99
Copy link
Contributor

What does this PR do?

This fixes the problem models using sentencepiece-based tokenizers can not prevent bad words when decoding.
For sentencepiece-based tokenizers like T5Tokenizer, when creating bad_words_ids from bad_words, add_special_tokens must be set to False.

Code to reproduce

from transformers import T5Tokenizer, AutoModelForCausalLM, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small", use_fast=False)
model = T5ForConditionalGeneration.from_pretrained("t5-small")


bad_words = ["my", "will", "My", "you", "are", "I", "You", "it"] # words should not be generated 
bad_words_ids = tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids # get bad words ids 

input_context = "You are my friend"
# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids, num_return_sequences=3)
gen_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Who can review?

@patrickvonplaten @LysandreJik

@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Jan 26, 2022

The documentation is not available anymore as the PR was closed or merged.

@patrickvonplaten
Copy link
Contributor

Hey @ngoquanghuy99,

That's a great fix - thanks a lot for diving into this :-)
Could you run make style once so that the check_code_quality test goes green?

@ngoquanghuy99 ngoquanghuy99 changed the title Fix bad_word_ids not working with sentencepiece-based tokenizers Fix bad_words_ids not working with sentencepiece-based tokenizers Jan 27, 2022
@patrickvonplaten patrickvonplaten merged commit 8f5d62f into huggingface:master Jan 28, 2022
@ngoquanghuy99 ngoquanghuy99 deleted the patch-1 branch January 28, 2022 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants