[`Generate`] Add conditional generation for multimodal models by younesbelkada · Pull Request #22424 · huggingface/transformers

younesbelkada · 2023-03-28T11:23:51Z

Motivation

Some multi-modal models (precisely, image to text models) can perform better if conditional text is passed. This simply means that input_ids created by _prepare_decoder_input_ids_for_generation is concatenated with input_ids that is passed along model_kwargs.

This PR aims to add the support for this feature for VisionEncoderDecoderModel, precisely now this script should be able to run without any problem:

import torch
import requests
from PIL import Image
from transformers import ViTFeatureExtractor, AutoTokenizer, VisionEncoderDecoderModel


loc = "ydshieh/vit-gpt2-coco-en"

feature_extractor = ViTFeatureExtractor.from_pretrained(loc)
tokenizer = AutoTokenizer.from_pretrained(loc)
model = VisionEncoderDecoderModel.from_pretrained(loc)
model.eval()


def predict(image, text):
    pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
    input_ids = tokenizer(text, return_tensors="pt").input_ids

    with torch.no_grad():
        output_ids = model.generate(pixel_values, input_ids=input_ids, max_length=16, num_beams=4, return_dict_in_generate=True).sequences

    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    preds = [pred.strip() for pred in preds]

    return preds


# We will verify our results on an image of cute cats
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
text = "an image of"
with Image.open(requests.get(url, stream=True).raw) as image:
    preds = predict(image, text)

print(preds)
>>> ['an image of two cats sleeping on a bed']

cc @gante

Related: #22423

HuggingFaceDocBuilderDev · 2023-03-28T11:38:07Z

The documentation is not available anymore as the PR was closed or merged.

gante

LGTM, but a question for potential simplification! :D

gante · 2023-03-29T08:38:21Z

src/transformers/generation/utils.py

+            # conditional generation for multi-modal models.
+            if "input_ids" in model_kwargs and model_input_name == "pixel_values":
+                input_ids = torch.cat([input_ids, model_kwargs.pop("input_ids")], dim=-1)


~~Uhmmm this seems to be the same logic as below (input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")), which is applied on encoder-decoder models.~~

~~Perhaps instead of adding these lines, we can remove the else: below?~~

EDIT: missed that the line below depends in inputs_tensor, which makes fusing hard. Don't worry about it :)

sgugger

Thanks for the fix!

younesbelkada · 2023-03-29T13:14:29Z

As this is slightly experimental, I ran blip slow tests that also includes conditional generation tests and they all pass, will merge!

…gface#22424) * add conditional generation * add comments

cramraj8 · 2023-07-11T04:07:45Z

Hi @younesbelkada , I get a following error during the training stage when providing decoder_input_ids argument. Does this modification only works for inference stage or training too ?

-> 3029 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (18) to match target batch_size (512).

I used batch_size 2 and beam_size 10

For a batch of examples (during training or inference), does the input_ids has to be same shape with padding even though each example prefix can be different length ?

Does input_ids = tokenizer(text, return_tensors="pt").input_ids has to exclude special_tokens by giving the argument add_special_tokens=False

cramraj8 · 2023-07-11T14:34:25Z

@younesbelkada @gante @sgugger When I inspected the intermediate outputs during the training, decoder_input_ids being shape [2, 8] and logits being shape [2, 8, 64002], where batch_size is 2 and prefix length is 8. Looks like decoder_outputs = self.decoder() is not predicting anything else right after prefix given decoder_input_ids

gante · 2023-07-11T14:53:47Z

Hey @cramraj8 -- would you be able to open a new issue, containing a short self-contained script so we can reproduce it? :)

add conditional generation

7650009

younesbelkada requested a review from gante March 28, 2023 11:23

younesbelkada mentioned this pull request Mar 28, 2023

[pipeline] Add conditional text support for ImageToTextPipeline #22423

Closed

add comments

81486b8

gante approved these changes Mar 29, 2023

View reviewed changes

younesbelkada requested a review from sgugger March 29, 2023 11:21

sgugger approved these changes Mar 29, 2023

View reviewed changes

younesbelkada merged commit 8252e24 into huggingface:main Mar 29, 2023

younesbelkada deleted the generate-fix-cond-generation branch March 29, 2023 13:35

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

[Generate] Add conditional generation for multimodal models (huggin…

5ff4808

…gface#22424) * add conditional generation * add comments

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

[Generate] Add conditional generation for multimodal models (huggin…

ced1e18

…gface#22424) * add conditional generation * add comments

cramraj8 mentioned this pull request Jul 11, 2023

Training stage error with batch mode on conditional generation for multimodal models #24752

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Generate`] Add conditional generation for multimodal models#22424

[`Generate`] Add conditional generation for multimodal models#22424
younesbelkada merged 2 commits intohuggingface:mainfrom
younesbelkada:generate-fix-cond-generation

younesbelkada commented Mar 28, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2023 •

edited

Loading

Uh oh!

gante left a comment

Uh oh!

gante Mar 29, 2023 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

younesbelkada commented Mar 29, 2023

Uh oh!

cramraj8 commented Jul 11, 2023 •

edited

Loading

Uh oh!

cramraj8 commented Jul 11, 2023

Uh oh!

gante commented Jul 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

younesbelkada commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

younesbelkada commented Mar 29, 2023

Uh oh!

cramraj8 commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cramraj8 commented Jul 11, 2023

Uh oh!

gante commented Jul 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

younesbelkada commented Mar 28, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 28, 2023 •

edited

Loading

gante Mar 29, 2023 •

edited

Loading

cramraj8 commented Jul 11, 2023 •

edited

Loading