[pipeline] Add conditional text support for ImageToTextPipeline#22423
[pipeline] Add conditional text support for ImageToTextPipeline#22423younesbelkada wants to merge 10 commits intohuggingface:mainfrom
pipeline] Add conditional text support for ImageToTextPipeline#22423Conversation
pipeline] Add conditional text supportpipeline] Add conditional text support for ImageToTextPipeline
|
The documentation is not available anymore as the PR was closed or merged. |
|
I have doubts about this: My biggest issue is with the expected I/O. Pipeline are defined by I/O. In the particular case it's (image) -> (text) This is ok, if and only if the extra text is always purely accessory, which doesn't seem to be the case.
In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs. For reference, the prompt reminds me of |
|
Thanks for your comments! To reply to some of your doubts:
Definitely yes, what I meant here is that text is always accessory for most of the models (Blip, Pix2Struct trained on image captioning), however some Pix2Struct models, trained on VQA needs text inputs, but the text inputs are dealt in an unusual way (the question is directly rendered on the image)
Agreed on that, I have updated the checking logic for pix2struct, I also realized that
Agreed, this has been removed
I had to come up with this as the CI tests handles multiple input type (list of images, generators, etc.), I'll do my best to refactor this to make things simpler |
| model_inputs = None | ||
| if isinstance(images, list) and texts is not None and isinstance(texts, list): | ||
| model_inputs = [{"images": image, "texts": texts} for image, texts in zip(images, texts)] | ||
| elif isinstance(images, list) and texts is not None: | ||
| # same text on all images | ||
| model_inputs = [{"images": image, "texts": texts} for image in images] | ||
| elif isinstance(images, list) and texts is None: | ||
| # no text | ||
| model_inputs = [{"images": image} for image in images] | ||
| elif not isinstance(images, list) and texts is None: | ||
| # classic input with only images | ||
| model_inputs = images | ||
| elif not isinstance(images, list) and texts is not None: | ||
| # classic input with images and text | ||
| model_inputs = {"images": images, "texts": texts} |
There was a problem hiding this comment.
This is intrinsically bad code.
Any code looking like this in other pipelines was added for legacy purposes and being backward compatible.
Adding this type of code is inherently wrong IMO.
Imagine you are not allowed to inspect anything coming your way. You need to trust the base class to unpack everything for you, and preprocess will receive a single image. That's it.
No preprocessing can happen here.
That means text cannot be an actual argument of the pipeline (as mentioned in the previous comment, the only potentially satisfactory path, is more like a glorified parameter being 100% optional). Sending different texts on 1 given call should be purely disallowed IMO (because of all the hoops you have to go through here).
And I say that without even starting to mention all the missing checks everywhere for sizes and the fact that the output of this path is not even a single type. (There's a the very least a dict vs list of dicts and most likely an entire array of potential nullables everywhere.)
There was a problem hiding this comment.
Thanks, after thinking about it that make sense. I had to come up with these hacks to deal with different configurations (mainly, multiple images + text), which should not be supported as you are suggesting, I proposed some changes in 4f08c05 & ed4a69d
Narsil
left a comment
There was a problem hiding this comment.
Looks much better to me now !
What do you think ?
| def preprocess(self, images, texts=None): | ||
| images = load_image(images) | ||
|
|
||
| if texts is not None and isinstance(texts, list) and len(texts) > 1: |
There was a problem hiding this comment.
Call it just text (without s) and just check it's a str ?
In terms of names text is not highly explicit in terms of intent.
promptprepromptoutput_prefix
?
If None are better, feel free to ignore.
There was a problem hiding this comment.
I would say conditional_prompt would be nice! EDIT: I prefer preprompt
Sounds very good for the first check, let's only check if it's str
| @@ -175,11 +172,8 @@ def test_conditional_generation_pt_pix2struct(self): | |||
| outputs = pipe(image, text) | |||
There was a problem hiding this comment.
This should be an error, parameters need be passed as named arguments I think
| text = "a photography of" | ||
|
|
||
| outputs = pipe(image, text) | ||
| outputs = pipe(image, texts=[text]) |
There was a problem hiding this comment.
Let's just disallow the list.
| ) | ||
| forward_kwargs["generate_kwargs"]["max_new_tokens"] = max_new_tokens | ||
| if texts is not None: | ||
| preprocess_params["texts"] = texts |
There was a problem hiding this comment.
NIT: Let's keep preprocess before forward before postprocess (just for reader's sanity)
| def preprocess(self, image): | ||
| image = load_image(image) | ||
| model_inputs = self.image_processor(images=image, return_tensors=self.framework) | ||
| def preprocess(self, images, preprompt=None): |
There was a problem hiding this comment.
| def preprocess(self, images, preprompt=None): | |
| def preprocess(self, images, prompt=None): |
Any reason this is called preprompt instead of prompt?
NielsRogge
left a comment
There was a problem hiding this comment.
I'm wondering how we'll deal with models (like Pix2Struct) which solve captioning and VQA using the same model. We also have a separate VQA pipeline whose API is (image, question) -> answer. Shouldn't we add the question rendered on the image in the VQA pipeline instead of the image-to-text pipeline?
| if preprompt is not None and self.model.config.model_type == "pix2struct": | ||
| model_inputs = self.image_processor(images=images, header_text=preprompt, return_tensors=self.framework) |
There was a problem hiding this comment.
To elaborate on comment above, we could probably add this logic to the VQA pipeline (assuming rendering text on images is only done for VQA)?
|
@Narsil what's your opinion on my comment above? e.g. Should we add |
|
Is there any difference in the code between vqa and captionning ? In general, pipelines are defined by I/O (input/output meaning (image, text) -> (text)). The rest is semantics, naming and goals. For instance NER and token-classification are the same, and alias each other. |
For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2 different models are defined, but other than that the code is also identical (correct me if I'm wrong @younesbelkada). I think adding an optional text prompt to the |
Why does it ? You said above that both were correct ? Did I misunderstand something ? |
|
Yes we can technically add them to both pipelines, if you are fine with that. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Closing in favor of #23362 |
|
@younesbelkada @NielsRogge I wonder this conditional text support for the ImageToTextPipeline models only enable inference stage or training stage as well ? |
What does this PR do?
This PR aims to add conditional generation support for image to text models. Sometimes if you guide the model with a prompt, you can achieve better results.
This PR also adds
pix2structon the supported models for ImageToTextPipeline. As most of pix2struct models uses conditional generation (for VQA models) and as we have a single classPix2StructForConditionalGenerationthat wraps both the VQA models and Image captioning models, I thought the best solution would be to simply add the conditional text support forImageToTextPipeline.The reason why a
Pix2StructForVQAclass is not implemented is that this model renders the question directly on the image, instead of feeding the text input into the model. Hence a single classPix2StructForConditionalGenerationis needed and the changes will be done on the processor side that will take care of rendering the text on the image.cc @NielsRogge @Narsil