[`pipeline`] Add conditional text support for `ImageToTextPipeline` by younesbelkada · Pull Request #22423 · huggingface/transformers

younesbelkada · 2023-03-28T09:58:30Z

What does this PR do?

This PR aims to add conditional generation support for image to text models. Sometimes if you guide the model with a prompt, you can achieve better results.
This PR also adds pix2struct on the supported models for ImageToTextPipeline. As most of pix2struct models uses conditional generation (for VQA models) and as we have a single class Pix2StructForConditionalGeneration that wraps both the VQA models and Image captioning models, I thought the best solution would be to simply add the conditional text support for ImageToTextPipeline.
The reason why a Pix2StructForVQA class is not implemented is that this model renders the question directly on the image, instead of feeding the text input into the model. Hence a single class Pix2StructForConditionalGeneration is needed and the changes will be done on the processor side that will take care of rendering the text on the image.

cc @NielsRogge @Narsil

HuggingFaceDocBuilderDev · 2023-03-28T10:18:21Z

The documentation is not available anymore as the PR was closed or merged.

Narsil · 2023-03-28T10:23:35Z

I have doubts about this:

My biggest issue is with the expected I/O.

Pipeline are defined by I/O. In the particular case it's (image) -> (text)
And here we're modifying to (image, text) -> (text).

This is ok, if and only if the extra text is always purely accessory, which doesn't seem to be the case.

Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.
All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).
- For instance for models that DO NOT handle extra text, the pipeline is going to start generating errors.
padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

For reference, the prompt reminds me of hypothesis_template within the zero-shot-classification. If a good sane default exists which can alleviate the I/O issue then this becomes a bit better.

younesbelkada · 2023-03-28T11:16:19Z

Thanks for your comments! To reply to some of your doubts:

Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.

Definitely yes, what I meant here is that text is always accessory for most of the models (Blip, Pix2Struct trained on image captioning), however some Pix2Struct models, trained on VQA needs text inputs, but the text inputs are dealt in an unusual way (the question is directly rendered on the image)

All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).

Agreed on that, I have updated the checking logic for pix2struct, I also realized that vision-encoder-decoder models does not support conditional generation. However I believe that there is a fix that I can quickly address on another PR, if that PR gets merged, this pipeline would support text-conditioned image-to-text inference for all models. (EDIT: #22424)

padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

Agreed, this has been removed

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

I had to come up with this as the CI tests handles multiple input type (list of images, generators, etc.), I'll do my best to refactor this to make things simpler

Narsil · 2023-03-28T11:57:26Z

src/transformers/pipelines/image_to_text.py

+        model_inputs = None
+        if isinstance(images, list) and texts is not None and isinstance(texts, list):
+            model_inputs = [{"images": image, "texts": texts} for image, texts in zip(images, texts)]
+        elif isinstance(images, list) and texts is not None:
+            # same text on all images
+            model_inputs = [{"images": image, "texts": texts} for image in images]
+        elif isinstance(images, list) and texts is None:
+            # no text
+            model_inputs = [{"images": image} for image in images]
+        elif not isinstance(images, list) and texts is None:
+            # classic input with only images
+            model_inputs = images
+        elif not isinstance(images, list) and texts is not None:
+            # classic input with images and text
+            model_inputs = {"images": images, "texts": texts}


This is intrinsically bad code.

Any code looking like this in other pipelines was added for legacy purposes and being backward compatible.
Adding this type of code is inherently wrong IMO.

Imagine you are not allowed to inspect anything coming your way. You need to trust the base class to unpack everything for you, and preprocess will receive a single image. That's it.
No preprocessing can happen here.

That means text cannot be an actual argument of the pipeline (as mentioned in the previous comment, the only potentially satisfactory path, is more like a glorified parameter being 100% optional). Sending different texts on 1 given call should be purely disallowed IMO (because of all the hoops you have to go through here).

And I say that without even starting to mention all the missing checks everywhere for sizes and the fact that the output of this path is not even a single type. (There's a the very least a dict vs list of dicts and most likely an entire array of potential nullables everywhere.)

Thanks, after thinking about it that make sense. I had to come up with these hacks to deal with different configurations (mainly, multiple images + text), which should not be supported as you are suggesting, I proposed some changes in 4f08c05 & ed4a69d

Narsil

Looks much better to me now !

What do you think ?

Narsil · 2023-03-28T12:43:42Z

src/transformers/pipelines/image_to_text.py

+    def preprocess(self, images, texts=None):
+        images = load_image(images)
+
+        if texts is not None and isinstance(texts, list) and len(texts) > 1:


Call it just text (without s) and just check it's a str ?

In terms of names text is not highly explicit in terms of intent.

prompt

preprompt

output_prefix
?

If None are better, feel free to ignore.

I would say conditional_prompt would be nice! EDIT: I prefer preprompt

Sounds very good for the first check, let's only check if it's str

Narsil · 2023-03-28T12:44:26Z

tests/pipelines/test_pipelines_image_to_text.py

@@ -175,11 +172,8 @@ def test_conditional_generation_pt_pix2struct(self):
        outputs = pipe(image, text)


This should be an error, parameters need be passed as named arguments I think

Narsil · 2023-03-28T12:44:37Z

tests/pipelines/test_pipelines_image_to_text.py

        text = "a photography of"

-        outputs = pipe(image, text)
+        outputs = pipe(image, texts=[text])


Let's just disallow the list.

Fixed in 7fc991f

Narsil · 2023-03-28T12:45:19Z

src/transformers/pipelines/image_to_text.py

                )
            forward_kwargs["generate_kwargs"]["max_new_tokens"] = max_new_tokens
+        if texts is not None:
+            preprocess_params["texts"] = texts


NIT: Let's keep preprocess before forward before postprocess (just for reader's sanity)

Should be fixed in 7fc991f!

NielsRogge · 2023-03-31T18:56:49Z

src/transformers/pipelines/image_to_text.py

-    def preprocess(self, image):
-        image = load_image(image)
-        model_inputs = self.image_processor(images=image, return_tensors=self.framework)
+    def preprocess(self, images, preprompt=None):


Suggested change

def preprocess(self, images, preprompt=None):

def preprocess(self, images, prompt=None):

Any reason this is called preprompt instead of prompt?

NielsRogge

I'm wondering how we'll deal with models (like Pix2Struct) which solve captioning and VQA using the same model. We also have a separate VQA pipeline whose API is (image, question) -> answer. Shouldn't we add the question rendered on the image in the VQA pipeline instead of the image-to-text pipeline?

NielsRogge · 2023-03-31T18:59:30Z

src/transformers/pipelines/image_to_text.py

+        if preprompt is not None and self.model.config.model_type == "pix2struct":
+            model_inputs = self.image_processor(images=images, header_text=preprompt, return_tensors=self.framework)


To elaborate on comment above, we could probably add this logic to the VQA pipeline (assuming rendering text on images is only done for VQA)?

NielsRogge · 2023-04-24T09:30:58Z

@Narsil what's your opinion on my comment above? e.g. Pix2StructForConditionalGeneration solves both image captioning and VQA with the same model, using the same approach. You can, in addition to the input image, also feed a text prompt to either 1) guide the captioning 2) ask a question related to the image. In both cases, the model renders the prompt on top of the image to make the prediction.

Should we add Pix2StructForConditionalGeneration to both the image-to-text and VQA pipelines?

Narsil · 2023-04-24T12:50:42Z

Is there any difference in the code between vqa and captionning ?

In general, pipelines are defined by I/O (input/output meaning (image, text) -> (text)). The rest is semantics, naming and goals.

For instance NER and token-classification are the same, and alias each other.

NielsRogge · 2023-04-24T14:24:55Z

Is there any difference in the code between vqa and captioning ?

For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2 different models are defined, but other than that the code is also identical (correct me if I'm wrong @younesbelkada).

I think adding an optional text prompt to the ImageToTextPipeline makes sense, however I wonder if that doesn't make the VQA pipeline obsolete for those models

Narsil · 2023-04-24T15:24:19Z

however I wonder if that doesn't make the VQA pipeline obsolete for those models

Why does it ? You said above that both were correct ? Did I misunderstand something ?

NielsRogge · 2023-04-24T19:44:45Z

Yes we can technically add them to both pipelines, if you are fine with that.

github-actions · 2023-05-19T15:02:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada · 2023-05-19T15:03:56Z

Closing in favor of #23362

cramraj8 · 2023-07-12T02:39:47Z

@younesbelkada @NielsRogge I wonder this conditional text support for the ImageToTextPipeline models only enable inference stage or training stage as well ?

add conditional text support

494a7cc

younesbelkada changed the title ~~[pipeline] Add conditional text support~~ [pipeline] Add conditional text support for ImageToTextPipeline Mar 28, 2023

younesbelkada added 2 commits March 28, 2023 10:00

revert unneeded changes

aa95b92

format

d2c3b87

younesbelkada added 4 commits March 28, 2023 10:27

fix slow test

35a1eb4

fix slow tests

dfd40b7

few fixes

2a5c2eb

fix CI tests

85f2dcd

younesbelkada mentioned this pull request Mar 28, 2023

[Generate] Add conditional generation for multimodal models #22424

Merged

Narsil reviewed Mar 28, 2023

View reviewed changes

younesbelkada added 2 commits March 28, 2023 12:22

changes

4f08c05

fix typos

ed4a69d

Narsil reviewed Mar 28, 2023

View reviewed changes

few fixes

7fc991f

younesbelkada requested a review from Narsil March 28, 2023 13:01

NielsRogge reviewed Mar 31, 2023

View reviewed changes

NielsRogge mentioned this pull request May 15, 2023

[image-to-text pipeline] Add conditional text support + GIT #23362

Merged

1 task

younesbelkada closed this May 19, 2023

		@@ -175,11 +172,8 @@ def test_conditional_generation_pt_pix2struct(self):
		outputs = pipe(image, text)

	def preprocess(self, images, preprompt=None):
	def preprocess(self, images, prompt=None):

		if preprompt is not None and self.model.config.model_type == "pix2struct":
		model_inputs = self.image_processor(images=images, header_text=preprompt, return_tensors=self.framework)

Conversation

younesbelkada commented Mar 28, 2023

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Mar 28, 2023

Uh oh!

younesbelkada commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge left a comment

Choose a reason for hiding this comment

Uh oh!

NielsRogge Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge commented Apr 24, 2023

Uh oh!

Narsil commented Apr 24, 2023

Uh oh!

NielsRogge commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Apr 24, 2023

Uh oh!

NielsRogge commented Apr 24, 2023

Uh oh!

github-actions bot commented May 19, 2023

Uh oh!

younesbelkada commented May 19, 2023

Uh oh!

cramraj8 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Mar 28, 2023 •

edited

Loading

younesbelkada commented Mar 28, 2023 •

edited

Loading

younesbelkada Mar 28, 2023 •

edited

Loading

NielsRogge Mar 31, 2023 •

edited

Loading

NielsRogge commented Apr 24, 2023 •

edited

Loading

cramraj8 commented Jul 12, 2023 •

edited

Loading