Add InstructBlip to VQA pipeline by stevenmanton · Pull Request #26885 · huggingface/transformers

stevenmanton · 2023-10-17T22:38:41Z

What does this PR do?

This PR attempts to support InstructBlip as an acceptable model in the VQA pipeline. Currently, only Blip2 is supported.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@NielsRogge @amyeroberts @Narsil

Description

This PR is not complete, but I'd like some guidance so that I can finalize the contribution. One issue is that InstructBlip handles the tokenization a bit. I have a bit of an ugly check; maybe this can be improved.

The bigger issue is that the pipeline loads a BlipImageProcessor instead of InstructBlipProcessor. I think this comes from this config. The wrinkle here is that the Q-former (the InstructBlipProcessor class) is a layer above the image processor (the BlipImageProcessor class). We need a handle to an instantiated InstructBlipProcessor, but the pipeline function doesn't seem to load it. I don't see a hook for subclasses to load additional processors either. Maybe add the InstructBlipProcessor as a kwarg to __init__ in the VisualQuestionAnsweringPipeline class? Would love feedback from the core devs.

Testing

The following code works with my changes:

from transformers import pipeline

checkpoint = "Salesforce/instructblip-flan-t5-xl"
pipe = transformers.pipeline("vqa", model=checkpoint)

# This shouldn't be necessary, but I'm not sure why AutoImageProcessor loads the wrong processor:
image_processor = InstructBlipProcessor.from_pretrained(checkpoint)
pipe.image_processor = image_processor

image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
pipe(question="What is she wearing ?", image=image_url)
# [{'answer': 'hat'}]

ArthurZucker · 2023-10-18T09:40:10Z

Hey! Feel free to ping @rafaelpadilla as well once this is ready!

stevenmanton · 2023-10-18T20:34:37Z

Ok, I've added some simple logic to the the VisualQuestionAnsweringPipeline class that will load the InstructBlipProcessor if required. It still seems a little clunky to me, but I don't know of a better way. I also added a slow test for InstructBlip and all the tests in tests/pipelines/test_pipelines_visual_question_answering.py pass.

@rafaelpadilla Can you take a look as suggested by @ArthurZucker ? Thanks, team!

NielsRogge · 2023-10-18T20:39:51Z

+            model_inputs.update(image_features)
+        else:
+            # InstructBlip processes the image and text simultaneously:
+            model_inputs = self.processor(images=image, text=inputs["question"], return_tensors=self.framework)


we do not use the processor classes in the pipelines explicitly (@Narsil could explain why).

hence we use the image_processor and tokenizer separately inside the pipeline, so for InstructBLIP it would look very similar to BLIP-2

Ok, so you basically reimplement what's in Blip2Processor.__call__ within VisualQuestionAnsweringPipeline.preprocess using the tokenizer and image_processor already available? And since InstructBlip passes the text encoding to the image_processor as well, I still need a switch to handle the inputs differently, it's just that I don't need to load the InstructBlipProcessor? I think I'll understand. I'll update the PR.

I looked into this and see a complication. The Blip2Processor only requires a tokenizer and image_processor, both of which are available to the pipeline. However, the InstructBlipProcessor uses a tokenizer, image_processor, and qformer_tokenizer; the latter is not available within the pipeline object. So how can we do a forward pass without loading InstructBlipProcessor?

In that case, we will need to load the qformer_tokenizer inside the pipeline, only if the model_type is "instructblip"

Thanks for clarifying. I'll look into your suggestion.

Yes, I'm curious about the decision not to use the Processor classes in pipelines. It seems like the alternative is a lot of work duplicating functionality and pushing implementation details (e.g. the existence of qformers) into the pipeline code instead of using the processor abstraction.

@NielsRogge Can you please take another look? I've modified the implementation as you suggested. Thanks!

NielsRogge

Thanks for your PR, it looks good to me if we're going for the route of using separate tokenizers/image processors in the pipelines (as is the case now).

However I'm interested to hear @Narsil opinion in working on having a robust API for multimodal processors, such that pipelines will be able to leverage them, resulting in cleaner code and less if-else statements.

Narsil · 2023-11-03T10:59:22Z

@NielsRogge , it's all a question of invariant.

In order for some piece of software/code to be easy to use from the external world, it needs to uphold invariants, declare it's usage and outcome very clearly and fail clearly when used outside of that.
Typically so users really can forget about the internals, and focus on the inside.

Good case :

def lowercase(name: str)  -> str:
    # internals

I don't have to read the doc, or anything, I'll send it a str, and receive a str back, whatever the insides are. (and the name happens to be quite self explaining)

Bad case

def prepare(*args, *kwargs):
    # internals

I have no idea what I'm supposed to send, no idea what I'm going to get (am I even going to get anything ?). Do I even get always the same things depending on what I send.

All models should accept several tensors. We have a relatively standardized name for the ones which are commonly shared: input_ids is the text sequence in tensor form pixel_values are the images etc..
Not all models accept the same set of tensors, but all (to the best of my knowledge) are relatively clear on what's the minimal set of sendable tensors (Big asterisk here because we have lots of contrived pathways like input_embeds but I'm putting them aside for now, as I think it's a relatively marginal use case).

For multimodal, the number and names of tensors has expanded quite a bit, and I really feel like we should have 1 model class per standard signature, and not play around with the args like crazy (input_embeds kind of set a bad example here IMHO which is why I guess we've allowed ourselves to go that way).

Since multimodal kind of requires different kinds of models with potentially explicitly different signatures, we need to find a way to make the pipeline 's signature standard still.
With that regard, currently the upheld invariant is the entire chain (from the pipeline's point of view):

inputs = tokenizer(string)  # Omitting the return types 
logits = model(**inputs)

Not that inputs is a dict, which can contain arbitrary keys, but we know it's Dict[string, Tensor]. and string : str.

The problem with current Processor is that it does not uphold any for of signature invariant. It might accept anything string, pixel_values, and any combination thereof. It's a GREAT tool for short snippet but quite impossible to use in practice.

This is the sort of mess we should strive to avoid : https://github.com/search?q=repo%3Ahuggingface%2Ftransformers%20inspect.signature&type=code
Inspecting signature is slow, and most importantly can create super nasty silent bug. By modifying stuff on the fly without the caller's realization which might be removing/adding some important information which may not have been desired.

So for Processor to be more usage I think it should follow the pipeline's signature calls and output something like the tokenizer output Dict[string, Tensor].

-> TextProcessor ~= Tokenizer (For instance the marian tokenizer which contains 2 actual tokenizers would be a good contender, just like this particular PR which does pretty much the same but using 2 actual different tokenizer).
-> ImageProcessor exactly like the current one
-> TextImageProcessor (For something like visual qa, text ~= question + image-> pixel_values)
...etc..

Yes there would be many but each would uphold a standard (and different from each other) signature.
I think this could be done in a non breaking way but is probably a good way to go IMO.
It's also probably A LOT of work because of all the very weird tiny detail differences in multimodal models (which we might be able to make slightly more uniform).
Definitely not something I would have the time to dive into.

If something like that could exist we could have:
-> Each AutoModel knowing the XXXXProcessor that it needs so we could keep the classic way with AutoProcessor.
-> The pipeline knowing already the XXXProcessor it needs (because they have the same signature than the actual pipeline). And it could actually check that the model and the XXXProcessor do match.

Then the pipeline's code could probably simplify away a lot of those if/else.
The main issue I see which might be coming up, is in the parameters, which are the non standard way of modifying output.
For instance, things like return_timestamps within a pipeline. Should only the pipeline take care of it ? Should all the XXProcessor align with it (which would mean basically the pipeline would become def preprocess(...): self.processor( ...) (Not declared as kwargs of course with a real signature) ?

Does this make any sense to you ?

stevenmanton · 2023-11-13T03:31:16Z

Hi @Narsil those are all really great points. I agree with the need to manage interfaces more carefully. What do you suggest in the context of this PR? Are your suggestions that we should implement before merging this or is this a much broader scope that would be added or refactored later? Or is there a way that we can get there incrementally (e.g., new code adopts formal interfaces, but old code still works)? Thanks for your guidance!

ydshieh · 2023-11-15T15:38:07Z

Reading @Narsil , I have some questions:

Regarding the invariants, it seems to me you are looking for:
- the (main) inputs to a pipeline should match the (main) input(s) to the processor (if we ever want to use it). For example, for ImageToTextPipeline, we want to make sure, if its __call__ accepts images (possibly with text as an prompt), the processor used in that pipele should match this signature. We are sure this holds for text only and image only processor, but this is not the case for the general processor. Do I understand this correctly?
- for the output, I think the processors output Dict[string, Tensor] (well, if return_tensors=... is specified). So it's not clear to me what's the concern here. And just like the tokenizers, the dict's can contain arbitrary kesy, but this is not really an issue (as you mentioned in your comment), and that is not pipeline's responsiblility to assure, but the binding of a processor to a model should match (during model addition, inside our auto mapping, etc.). So could you elaborate a bit more regarindg this processor output part if this is really an concern?

Narsil · 2023-11-27T11:48:31Z

Do I understand this correctly?

Yes Processor in general don't have a fixed signature.

So could you elaborate a bit more regarindg this processor output part if this is really an concern?

Processor output being undeterministic, just underlines the fact that we are lacking clear signatures to AutoModel.
This is acceptable since we're using this property already for model(**tokenizer(string, return_tensors="xx")) but not ideal in my book.
Ideal is as the name suggests, an ideal, something we should be something we should aim for, but not an absolute set of rules, pragmatic solutions will always beat ideal code, and ideal code never actually exists.

NielsRogge · 2023-11-27T16:29:44Z

In spirit of that, I'm adding common processor tests, starting at #27720. This to make sure that multimodal processors follow the same API (or at least, to the greatest extent possible).

Similar PRs are done for testing the forward signature of models, at #27681 and #27729.

github-actions · 2024-03-08T08:06:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stevenmanton added 2 commits October 17, 2023 15:28

Initial attempt to add InstructBlip to VQA pipeline

eb84e20

Correct the InstructBlip ImageProcessor

d53ff1a

stevenmanton added 2 commits October 18, 2023 13:07

Load InstructBlipProcessor in __init__

732d88e

Add a test

c93c69d

NielsRogge reviewed Oct 18, 2023

View reviewed changes

stevenmanton added 2 commits October 19, 2023 21:01

Load qformer_tokenizer in pipeline

d72d466

Fix format

c3f37d9

stevenmanton requested a review from NielsRogge October 24, 2023 18:22

NielsRogge reviewed Oct 31, 2023

View reviewed changes

molbap mentioned this pull request Nov 30, 2023

[WIP] Uniformize processors in text+image multimodal models. #27768

Draft

33 tasks

huggingface deleted a comment from github-actions Bot Dec 22, 2023

huggingface deleted a comment from github-actions Bot Jan 16, 2024

huggingface deleted a comment from github-actions Bot Feb 12, 2024

github-actions Bot closed this Mar 16, 2024

Conversation

stevenmanton commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Description

Testing

Uh oh!

ArthurZucker commented Oct 18, 2023

Uh oh!

stevenmanton commented Oct 18, 2023

Uh oh!

NielsRogge Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

stevenmanton Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

stevenmanton Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

NielsRogge Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenmanton Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

stevenmanton Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

NielsRogge left a comment

Choose a reason for hiding this comment

Uh oh!

Narsil commented Nov 3, 2023

Uh oh!

stevenmanton commented Nov 13, 2023

Uh oh!

ydshieh commented Nov 15, 2023

Uh oh!

Narsil commented Nov 27, 2023

Uh oh!

NielsRogge commented Nov 27, 2023

Uh oh!

github-actions Bot commented Mar 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevenmanton commented Oct 17, 2023 •

edited

Loading

NielsRogge Oct 19, 2023 •

edited

Loading