Skip to content

Add InstructBlip to VQA pipeline#26885

Closed
stevenmanton wants to merge 6 commits into
huggingface:mainfrom
stevenmanton:add-instructblip-to-vqa-pipeline
Closed

Add InstructBlip to VQA pipeline#26885
stevenmanton wants to merge 6 commits into
huggingface:mainfrom
stevenmanton:add-instructblip-to-vqa-pipeline

Conversation

@stevenmanton

@stevenmanton stevenmanton commented Oct 17, 2023

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR attempts to support InstructBlip as an acceptable model in the VQA pipeline. Currently, only Blip2 is supported.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@NielsRogge @amyeroberts @Narsil

Description

This PR is not complete, but I'd like some guidance so that I can finalize the contribution. One issue is that InstructBlip handles the tokenization a bit. I have a bit of an ugly check; maybe this can be improved.

The bigger issue is that the pipeline loads a BlipImageProcessor instead of InstructBlipProcessor. I think this comes from this config. The wrinkle here is that the Q-former (the InstructBlipProcessor class) is a layer above the image processor (the BlipImageProcessor class). We need a handle to an instantiated InstructBlipProcessor, but the pipeline function doesn't seem to load it. I don't see a hook for subclasses to load additional processors either. Maybe add the InstructBlipProcessor as a kwarg to __init__ in the VisualQuestionAnsweringPipeline class? Would love feedback from the core devs.

Testing

The following code works with my changes:

from transformers import pipeline

checkpoint = "Salesforce/instructblip-flan-t5-xl"
pipe = transformers.pipeline("vqa", model=checkpoint)

# This shouldn't be necessary, but I'm not sure why AutoImageProcessor loads the wrong processor:
image_processor = InstructBlipProcessor.from_pretrained(checkpoint)
pipe.image_processor = image_processor

image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
pipe(question="What is she wearing ?", image=image_url)
# [{'answer': 'hat'}]

@ArthurZucker

Copy link
Copy Markdown
Collaborator

Hey! Feel free to ping @rafaelpadilla as well once this is ready!

@stevenmanton

Copy link
Copy Markdown
Contributor Author

Ok, I've added some simple logic to the the VisualQuestionAnsweringPipeline class that will load the InstructBlipProcessor if required. It still seems a little clunky to me, but I don't know of a better way. I also added a slow test for InstructBlip and all the tests in tests/pipelines/test_pipelines_visual_question_answering.py pass.

@rafaelpadilla Can you take a look as suggested by @ArthurZucker ? Thanks, team!

model_inputs.update(image_features)
else:
# InstructBlip processes the image and text simultaneously:
model_inputs = self.processor(images=image, text=inputs["question"], return_tensors=self.framework)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not use the processor classes in the pipelines explicitly (@Narsil could explain why).

hence we use the image_processor and tokenizer separately inside the pipeline, so for InstructBLIP it would look very similar to BLIP-2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you basically reimplement what's in Blip2Processor.__call__ within VisualQuestionAnsweringPipeline.preprocess using the tokenizer and image_processor already available? And since InstructBlip passes the text encoding to the image_processor as well, I still need a switch to handle the inputs differently, it's just that I don't need to load the InstructBlipProcessor? I think I'll understand. I'll update the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and see a complication. The Blip2Processor only requires a tokenizer and image_processor, both of which are available to the pipeline. However, the InstructBlipProcessor uses a tokenizer, image_processor, and qformer_tokenizer; the latter is not available within the pipeline object. So how can we do a forward pass without loading InstructBlipProcessor?

@NielsRogge NielsRogge Oct 19, 2023

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we will need to load the qformer_tokenizer inside the pipeline, only if the model_type is "instructblip"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I'll look into your suggestion.

Yes, I'm curious about the decision not to use the Processor classes in pipelines. It seems like the alternative is a lot of work duplicating functionality and pushing implementation details (e.g. the existence of qformers) into the pipeline code instead of using the processor abstraction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NielsRogge Can you please take another look? I've modified the implementation as you suggested. Thanks!

@NielsRogge NielsRogge left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR, it looks good to me if we're going for the route of using separate tokenizers/image processors in the pipelines (as is the case now).

However I'm interested to hear @Narsil opinion in working on having a robust API for multimodal processors, such that pipelines will be able to leverage them, resulting in cleaner code and less if-else statements.

@Narsil

Narsil commented Nov 3, 2023

Copy link
Copy Markdown
Contributor

@NielsRogge , it's all a question of invariant.

In order for some piece of software/code to be easy to use from the external world, it needs to uphold invariants, declare it's usage and outcome very clearly and fail clearly when used outside of that.
Typically so users really can forget about the internals, and focus on the inside.

Good case :

def lowercase(name: str)  -> str:
    # internals

I don't have to read the doc, or anything, I'll send it a str, and receive a str back, whatever the insides are. (and the name happens to be quite self explaining)

Bad case

def prepare(*args, *kwargs):
    # internals

I have no idea what I'm supposed to send, no idea what I'm going to get (am I even going to get anything ?). Do I even get always the same things depending on what I send.

All models should accept several tensors. We have a relatively standardized name for the ones which are commonly shared: input_ids is the text sequence in tensor form pixel_values are the images etc..
Not all models accept the same set of tensors, but all (to the best of my knowledge) are relatively clear on what's the minimal set of sendable tensors (Big asterisk here because we have lots of contrived pathways like input_embeds but I'm putting them aside for now, as I think it's a relatively marginal use case).

For multimodal, the number and names of tensors has expanded quite a bit, and I really feel like we should have 1 model class per standard signature, and not play around with the args like crazy (input_embeds kind of set a bad example here IMHO which is why I guess we've allowed ourselves to go that way).

Since multimodal kind of requires different kinds of models with potentially explicitly different signatures, we need to find a way to make the pipeline 's signature standard still.
With that regard, currently the upheld invariant is the entire chain (from the pipeline's point of view):

inputs = tokenizer(string)  # Omitting the return types 
logits = model(**inputs)

Not that inputs is a dict, which can contain arbitrary keys, but we know it's Dict[string, Tensor]. and string : str.

The problem with current Processor is that it does not uphold any for of signature invariant. It might accept anything string, pixel_values, and any combination thereof. It's a GREAT tool for short snippet but quite impossible to use in practice.

This is the sort of mess we should strive to avoid : https://github.com/search?q=repo%3Ahuggingface%2Ftransformers%20inspect.signature&type=code
Inspecting signature is slow, and most importantly can create super nasty silent bug. By modifying stuff on the fly without the caller's realization which might be removing/adding some important information which may not have been desired.

So for Processor to be more usage I think it should follow the pipeline's signature calls and output something like the tokenizer output Dict[string, Tensor].

-> TextProcessor ~= Tokenizer (For instance the marian tokenizer which contains 2 actual tokenizers would be a good contender, just like this particular PR which does pretty much the same but using 2 actual different tokenizer).
-> ImageProcessor exactly like the current one
-> TextImageProcessor (For something like visual qa, text ~= question + image-> pixel_values)
...etc..

Yes there would be many but each would uphold a standard (and different from each other) signature.
I think this could be done in a non breaking way but is probably a good way to go IMO.
It's also probably A LOT of work because of all the very weird tiny detail differences in multimodal models (which we might be able to make slightly more uniform).
Definitely not something I would have the time to dive into.

If something like that could exist we could have:
-> Each AutoModel knowing the XXXXProcessor that it needs so we could keep the classic way with AutoProcessor.
-> The pipeline knowing already the XXXProcessor it needs (because they have the same signature than the actual pipeline). And it could actually check that the model and the XXXProcessor do match.

Then the pipeline's code could probably simplify away a lot of those if/else.
The main issue I see which might be coming up, is in the parameters, which are the non standard way of modifying output.
For instance, things like return_timestamps within a pipeline. Should only the pipeline take care of it ? Should all the XXProcessor align with it (which would mean basically the pipeline would become def preprocess(...): self.processor( ...) (Not declared as kwargs of course with a real signature) ?

Does this make any sense to you ?

@stevenmanton

Copy link
Copy Markdown
Contributor Author

Hi @Narsil those are all really great points. I agree with the need to manage interfaces more carefully. What do you suggest in the context of this PR? Are your suggestions that we should implement before merging this or is this a much broader scope that would be added or refactored later? Or is there a way that we can get there incrementally (e.g., new code adopts formal interfaces, but old code still works)? Thanks for your guidance!

@ydshieh

ydshieh commented Nov 15, 2023

Copy link
Copy Markdown
Collaborator

Reading @Narsil , I have some questions:

  • Regarding the invariants, it seems to me you are looking for:
    • the (main) inputs to a pipeline should match the (main) input(s) to the processor (if we ever want to use it). For example, for ImageToTextPipeline, we want to make sure, if its __call__ accepts images (possibly with text as an prompt), the processor used in that pipele should match this signature. We are sure this holds for text only and image only processor, but this is not the case for the general processor. Do I understand this correctly?
    • for the output, I think the processors output Dict[string, Tensor] (well, if return_tensors=... is specified). So it's not clear to me what's the concern here. And just like the tokenizers, the dict's can contain arbitrary kesy, but this is not really an issue (as you mentioned in your comment), and that is not pipeline's responsiblility to assure, but the binding of a processor to a model should match (during model addition, inside our auto mapping, etc.). So could you elaborate a bit more regarindg this processor output part if this is really an concern?

@Narsil

Narsil commented Nov 27, 2023

Copy link
Copy Markdown
Contributor

Do I understand this correctly?

Yes Processor in general don't have a fixed signature.

So could you elaborate a bit more regarindg this processor output part if this is really an concern?

Processor output being undeterministic, just underlines the fact that we are lacking clear signatures to AutoModel.
This is acceptable since we're using this property already for model(**tokenizer(string, return_tensors="xx")) but not ideal in my book.
Ideal is as the name suggests, an ideal, something we should be something we should aim for, but not an absolute set of rules, pragmatic solutions will always beat ideal code, and ideal code never actually exists.

@NielsRogge

Copy link
Copy Markdown
Collaborator

In spirit of that, I'm adding common processor tests, starting at #27720. This to make sure that multimodal processors follow the same API (or at least, to the greatest extent possible).

Similar PRs are done for testing the forward signature of models, at #27681 and #27729.

@github-actions

github-actions Bot commented Mar 8, 2024

Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions Bot closed this Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants