Refactor image features selection in LlaVa by kenza-bouzid · Pull Request #33696 · huggingface/transformers

kenza-bouzid · 2024-09-25T10:26:42Z

What does this PR do?

Wrap image features selection in LLaVa in a separate function to make it easier to override for custom use cases (e.g. applying a layer norm on the image features before projection, etc).

Fixes #33695

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker, @younesbelkada

zucchini-nlp

Hey @kenza-bouzid !

Thanks for opening a PR! Yes, actually it would be nice to have as a rule that VLMs have a method to get_image_features, where all logic happens. My prev idea was to make code less cluttered when the VLM has many special crops/pads/concats. But the use-case of overriding the method for custom models is very coool

We already have a similar pattern in video LLMs, for the same reason of uncluttering the code. WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

And, also, we might need to change more models to follow the same pattern to make CI happy. If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

kenza-bouzid · 2024-09-26T09:58:37Z

Hi @zucchini-nlp,

Thanks for your reply.

WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

Good idea, however I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

We can have get_image_features and project_image_features that are then called sequentially in a get_image_embeddings ready to be merged with text embeddings as you suggested.

If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

zucchini-nlp · 2024-09-26T10:09:52Z

I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

Hmm, imo that is a bit redundant, and since "select+proj" is mostly interlinked and isn't a huge piece of code we may ask users to override that one method, Independently of whether they tweak at selection stage or projection stage. So that we don't end up with separate methods that perform a one-liner, which imo is a bad practice. Lmk if you have any other ideas or objections

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

No worries, I just realized it is the very first llava being modified here. So our CI should be happy after running make fix-copies and make style. For other VLMs that are more tricky, I'll make one round of standardization later. Not a blocker at all :)

kenza-bouzid · 2024-09-26T10:20:00Z

Sounds great! Let me make the changes you suggested. Thank youu!

kenza-bouzid · 2024-09-26T10:54:00Z

src/transformers/models/vipllava/modeling_vipllava.py

        self.vocab_size = model_embeds.num_embeddings
        return model_embeds

+    def get_image_features(


@zucchini-nlp make fix-copies copied this over but looking at the forward pass of vipllava, the image features selection is slightly different since they select features from for the layers -2, -5, -8, -11 and 6
I don't think it's breaking anything since this function is not called. Note that we'll need vision_feature_layers: list[int] and not only a single one!

Hmm, in that case better to make the method accept list in vipllava and add a "#Ignore copy" statement so that it is not copied from llava. That way we can use the get_image_features directly

I don't think it's a good idea to merge smth which doesn't work, even tho it technically doesn't break anything

agreed 💯 fixed now in f75e944

kenza-bouzid · 2024-09-30T13:48:13Z

@zucchini-nlp can you please approve/trigger the CI workflows for tests? Thank you!

zucchini-nlp

Thanks a lot! LGTM in general, left one comment for vipllava

I'm approving this PR, when you're ready with vipllava feel free to tag @ LysandreJik (core maintainer). After his approval, PR can be merged :)

zucchini-nlp · 2024-09-30T14:18:42Z

src/transformers/models/vipllava/modeling_vipllava.py

        self.vocab_size = model_embeds.num_embeddings
        return model_embeds

+    def get_image_features(


Hmm, in that case better to make the method accept list in vipllava and add a "#Ignore copy" statement so that it is not copied from llava. That way we can use the get_image_features directly

I don't think it's a good idea to merge smth which doesn't work, even tho it technically doesn't break anything

kenza-bouzid · 2024-09-30T14:31:24Z

@LysandreJik can you please have a look? Thanks

HuggingFaceDocBuilderDev · 2024-09-30T14:56:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM, the forward pass is pretty long, and this will come handy with modular soon!

* refactor image features selection * break line * remove whitespace * add pr comments: include projection and rename function * make fix-copies * fix get_image_feature in vip llava

kenza-bouzid added 3 commits September 25, 2024 03:15

refactor image features selection

304c9f4

break line

c5bf4d5

remove whitespace

0c626c5

LysandreJik requested a review from zucchini-nlp September 25, 2024 16:52

zucchini-nlp reviewed Sep 26, 2024

View reviewed changes

kenza-bouzid added 2 commits September 26, 2024 03:46

add pr comments: include projection and rename function

4592f8c

make fix-copies

465e660

kenza-bouzid commented Sep 26, 2024

View reviewed changes

kenza-bouzid requested a review from zucchini-nlp September 26, 2024 10:54

zucchini-nlp approved these changes Sep 30, 2024

View reviewed changes

fix get_image_feature in vip llava

f75e944

zucchini-nlp requested a review from LysandreJik September 30, 2024 14:32

ArthurZucker requested review from ArthurZucker and removed request for LysandreJik October 1, 2024 12:35

ArthurZucker approved these changes Oct 1, 2024

View reviewed changes

ArthurZucker merged commit 88d9609 into huggingface:main Oct 1, 2024

zucchini-nlp mentioned this pull request Oct 4, 2024

Major VLM tracker (standardize the API) #33948

Open

ArthurZucker mentioned this pull request Jan 7, 2025

[Qwen2Audio] handle input ids expansion during processing #35534

Merged

Conversation

kenza-bouzid commented Sep 25, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

kenza-bouzid commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Sep 26, 2024

Uh oh!

kenza-bouzid commented Sep 26, 2024

Uh oh!

kenza-bouzid Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

kenza-bouzid Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

kenza-bouzid commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

kenza-bouzid commented Sep 30, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Sep 30, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kenza-bouzid commented Sep 26, 2024 •

edited

Loading

kenza-bouzid commented Sep 30, 2024 •

edited

Loading