Skip to content

[pipeline] Add conditional text support for ImageToTextPipeline#22423

Closed
younesbelkada wants to merge 10 commits intohuggingface:mainfrom
younesbelkada:add-pix2struct-pipeline
Closed

[pipeline] Add conditional text support for ImageToTextPipeline#22423
younesbelkada wants to merge 10 commits intohuggingface:mainfrom
younesbelkada:add-pix2struct-pipeline

Conversation

@younesbelkada
Copy link
Contributor

What does this PR do?

This PR aims to add conditional generation support for image to text models. Sometimes if you guide the model with a prompt, you can achieve better results.
This PR also adds pix2struct on the supported models for ImageToTextPipeline. As most of pix2struct models uses conditional generation (for VQA models) and as we have a single class Pix2StructForConditionalGeneration that wraps both the VQA models and Image captioning models, I thought the best solution would be to simply add the conditional text support for ImageToTextPipeline.
The reason why a Pix2StructForVQA class is not implemented is that this model renders the question directly on the image, instead of feeding the text input into the model. Hence a single class Pix2StructForConditionalGeneration is needed and the changes will be done on the processor side that will take care of rendering the text on the image.

cc @NielsRogge @Narsil

@younesbelkada younesbelkada changed the title [pipeline] Add conditional text support [pipeline] Add conditional text support for ImageToTextPipeline Mar 28, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 28, 2023

The documentation is not available anymore as the PR was closed or merged.

@Narsil
Copy link
Contributor

Narsil commented Mar 28, 2023

I have doubts about this:

My biggest issue is with the expected I/O.

Pipeline are defined by I/O. In the particular case it's (image) -> (text)
And here we're modifying to (image, text) -> (text).

This is ok, if and only if the extra text is always purely accessory, which doesn't seem to be the case.

  • Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.

  • All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).

    • For instance for models that DO NOT handle extra text, the pipeline is going to start generating errors.
  • padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

For reference, the prompt reminds me of hypothesis_template within the zero-shot-classification. If a good sane default exists which can alleviate the I/O issue then this becomes a bit better.

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Mar 28, 2023

Thanks for your comments! To reply to some of your doubts:

Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.

Definitely yes, what I meant here is that text is always accessory for most of the models (Blip, Pix2Struct trained on image captioning), however some Pix2Struct models, trained on VQA needs text inputs, but the text inputs are dealt in an unusual way (the question is directly rendered on the image)

All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).

Agreed on that, I have updated the checking logic for pix2struct, I also realized that vision-encoder-decoder models does not support conditional generation. However I believe that there is a fix that I can quickly address on another PR, if that PR gets merged, this pipeline would support text-conditioned image-to-text inference for all models. (EDIT: #22424)

padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

Agreed, this has been removed

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

I had to come up with this as the CI tests handles multiple input type (list of images, generators, etc.), I'll do my best to refactor this to make things simpler

Comment on lines +110 to +124
model_inputs = None
if isinstance(images, list) and texts is not None and isinstance(texts, list):
model_inputs = [{"images": image, "texts": texts} for image, texts in zip(images, texts)]
elif isinstance(images, list) and texts is not None:
# same text on all images
model_inputs = [{"images": image, "texts": texts} for image in images]
elif isinstance(images, list) and texts is None:
# no text
model_inputs = [{"images": image} for image in images]
elif not isinstance(images, list) and texts is None:
# classic input with only images
model_inputs = images
elif not isinstance(images, list) and texts is not None:
# classic input with images and text
model_inputs = {"images": images, "texts": texts}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intrinsically bad code.

Any code looking like this in other pipelines was added for legacy purposes and being backward compatible.
Adding this type of code is inherently wrong IMO.

Imagine you are not allowed to inspect anything coming your way. You need to trust the base class to unpack everything for you, and preprocess will receive a single image. That's it.
No preprocessing can happen here.

That means text cannot be an actual argument of the pipeline (as mentioned in the previous comment, the only potentially satisfactory path, is more like a glorified parameter being 100% optional). Sending different texts on 1 given call should be purely disallowed IMO (because of all the hoops you have to go through here).

And I say that without even starting to mention all the missing checks everywhere for sizes and the fact that the output of this path is not even a single type. (There's a the very least a dict vs list of dicts and most likely an entire array of potential nullables everywhere.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, after thinking about it that make sense. I had to come up with these hacks to deal with different configurations (mainly, multiple images + text), which should not be supported as you are suggesting, I proposed some changes in 4f08c05 & ed4a69d

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better to me now !

What do you think ?

def preprocess(self, images, texts=None):
images = load_image(images)

if texts is not None and isinstance(texts, list) and len(texts) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it just text (without s) and just check it's a str ?

In terms of names text is not highly explicit in terms of intent.

  • prompt
  • preprompt
  • output_prefix
    ?

If None are better, feel free to ignore.

Copy link
Contributor Author

@younesbelkada younesbelkada Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say conditional_prompt would be nice! EDIT: I prefer preprompt

Sounds very good for the first check, let's only check if it's str

@@ -175,11 +172,8 @@ def test_conditional_generation_pt_pix2struct(self):
outputs = pipe(image, text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an error, parameters need be passed as named arguments I think

text = "a photography of"

outputs = pipe(image, text)
outputs = pipe(image, texts=[text])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just disallow the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7fc991f

)
forward_kwargs["generate_kwargs"]["max_new_tokens"] = max_new_tokens
if texts is not None:
preprocess_params["texts"] = texts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Let's keep preprocess before forward before postprocess (just for reader's sanity)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 7fc991f!

@younesbelkada younesbelkada requested a review from Narsil March 28, 2023 13:01
def preprocess(self, image):
image = load_image(image)
model_inputs = self.image_processor(images=image, return_tensors=self.framework)
def preprocess(self, images, preprompt=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def preprocess(self, images, preprompt=None):
def preprocess(self, images, prompt=None):

Any reason this is called preprompt instead of prompt?

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how we'll deal with models (like Pix2Struct) which solve captioning and VQA using the same model. We also have a separate VQA pipeline whose API is (image, question) -> answer. Shouldn't we add the question rendered on the image in the VQA pipeline instead of the image-to-text pipeline?

Comment on lines +111 to +112
if preprompt is not None and self.model.config.model_type == "pix2struct":
model_inputs = self.image_processor(images=images, header_text=preprompt, return_tensors=self.framework)
Copy link
Contributor

@NielsRogge NielsRogge Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on comment above, we could probably add this logic to the VQA pipeline (assuming rendering text on images is only done for VQA)?

@NielsRogge
Copy link
Contributor

@Narsil what's your opinion on my comment above? e.g. Pix2StructForConditionalGeneration solves both image captioning and VQA with the same model, using the same approach. You can, in addition to the input image, also feed a text prompt to either 1) guide the captioning 2) ask a question related to the image. In both cases, the model renders the prompt on top of the image to make the prediction.

Should we add Pix2StructForConditionalGeneration to both the image-to-text and VQA pipelines?

@Narsil
Copy link
Contributor

Narsil commented Apr 24, 2023

Is there any difference in the code between vqa and captionning ?

In general, pipelines are defined by I/O (input/output meaning (image, text) -> (text)). The rest is semantics, naming and goals.

For instance NER and token-classification are the same, and alias each other.

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 24, 2023

Is there any difference in the code between vqa and captioning ?

For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2 different models are defined, but other than that the code is also identical (correct me if I'm wrong @younesbelkada).

I think adding an optional text prompt to the ImageToTextPipeline makes sense, however I wonder if that doesn't make the VQA pipeline obsolete for those models

@Narsil
Copy link
Contributor

Narsil commented Apr 24, 2023

however I wonder if that doesn't make the VQA pipeline obsolete for those models

Why does it ? You said above that both were correct ? Did I misunderstand something ?

@NielsRogge
Copy link
Contributor

Yes we can technically add them to both pipelines, if you are fine with that.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@younesbelkada
Copy link
Contributor Author

Closing in favor of #23362

@cramraj8
Copy link

cramraj8 commented Jul 12, 2023

@younesbelkada @NielsRogge I wonder this conditional text support for the ImageToTextPipeline models only enable inference stage or training stage as well ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants