Add InstructBlip to VQA pipeline#26885
Conversation
|
Hey! Feel free to ping @rafaelpadilla as well once this is ready! |
|
Ok, I've added some simple logic to the the @rafaelpadilla Can you take a look as suggested by @ArthurZucker ? Thanks, team! |
| model_inputs.update(image_features) | ||
| else: | ||
| # InstructBlip processes the image and text simultaneously: | ||
| model_inputs = self.processor(images=image, text=inputs["question"], return_tensors=self.framework) |
There was a problem hiding this comment.
we do not use the processor classes in the pipelines explicitly (@Narsil could explain why).
hence we use the image_processor and tokenizer separately inside the pipeline, so for InstructBLIP it would look very similar to BLIP-2
There was a problem hiding this comment.
Ok, so you basically reimplement what's in Blip2Processor.__call__ within VisualQuestionAnsweringPipeline.preprocess using the tokenizer and image_processor already available? And since InstructBlip passes the text encoding to the image_processor as well, I still need a switch to handle the inputs differently, it's just that I don't need to load the InstructBlipProcessor? I think I'll understand. I'll update the PR.
There was a problem hiding this comment.
I looked into this and see a complication. The Blip2Processor only requires a tokenizer and image_processor, both of which are available to the pipeline. However, the InstructBlipProcessor uses a tokenizer, image_processor, and qformer_tokenizer; the latter is not available within the pipeline object. So how can we do a forward pass without loading InstructBlipProcessor?
There was a problem hiding this comment.
In that case, we will need to load the qformer_tokenizer inside the pipeline, only if the model_type is "instructblip"
There was a problem hiding this comment.
Thanks for clarifying. I'll look into your suggestion.
Yes, I'm curious about the decision not to use the Processor classes in pipelines. It seems like the alternative is a lot of work duplicating functionality and pushing implementation details (e.g. the existence of qformers) into the pipeline code instead of using the processor abstraction.
There was a problem hiding this comment.
@NielsRogge Can you please take another look? I've modified the implementation as you suggested. Thanks!
NielsRogge
left a comment
There was a problem hiding this comment.
Thanks for your PR, it looks good to me if we're going for the route of using separate tokenizers/image processors in the pipelines (as is the case now).
However I'm interested to hear @Narsil opinion in working on having a robust API for multimodal processors, such that pipelines will be able to leverage them, resulting in cleaner code and less if-else statements.
|
@NielsRogge , it's all a question of invariant. In order for some piece of software/code to be easy to use from the external world, it needs to uphold invariants, declare it's usage and outcome very clearly and fail clearly when used outside of that. Good case : I don't have to read the doc, or anything, I'll send it a Bad case I have no idea what I'm supposed to send, no idea what I'm going to get (am I even going to get anything ?). Do I even get always the same things depending on what I send. All models should accept several tensors. We have a relatively standardized name for the ones which are commonly shared: For multimodal, the number and names of tensors has expanded quite a bit, and I really feel like we should have 1 model class per standard signature, and not play around with the args like crazy ( Since multimodal kind of requires different kinds of models with potentially explicitly different signatures, we need to find a way to make the Not that The problem with current This is the sort of mess we should strive to avoid : https://github.com/search?q=repo%3Ahuggingface%2Ftransformers%20inspect.signature&type=code So for -> Yes there would be many but each would uphold a standard (and different from each other) signature. If something like that could exist we could have: Then the pipeline's code could probably simplify away a lot of those if/else. Does this make any sense to you ? |
|
Hi @Narsil those are all really great points. I agree with the need to manage interfaces more carefully. What do you suggest in the context of this PR? Are your suggestions that we should implement before merging this or is this a much broader scope that would be added or refactored later? Or is there a way that we can get there incrementally (e.g., new code adopts formal interfaces, but old code still works)? Thanks for your guidance! |
|
Reading @Narsil , I have some questions:
|
Yes Processor in general don't have a fixed signature.
Processor output being undeterministic, just underlines the fact that we are lacking clear signatures to |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR attempts to support InstructBlip as an acceptable model in the VQA pipeline. Currently, only Blip2 is supported.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@NielsRogge @amyeroberts @Narsil
Description
This PR is not complete, but I'd like some guidance so that I can finalize the contribution. One issue is that InstructBlip handles the tokenization a bit. I have a bit of an ugly check; maybe this can be improved.
The bigger issue is that the pipeline loads a
BlipImageProcessorinstead ofInstructBlipProcessor. I think this comes from this config. The wrinkle here is that the Q-former (theInstructBlipProcessorclass) is a layer above the image processor (theBlipImageProcessorclass). We need a handle to an instantiatedInstructBlipProcessor, but thepipelinefunction doesn't seem to load it. I don't see a hook for subclasses to load additional processors either. Maybe add theInstructBlipProcessoras a kwarg to__init__in theVisualQuestionAnsweringPipelineclass? Would love feedback from the core devs.Testing
The following code works with my changes: