Skip to content

Refactor image features selection in LlaVa#33696

Merged
ArthurZucker merged 6 commits intohuggingface:mainfrom
kenza-bouzid:kenzabouzid-refactor-image-features-selection-in-llava
Oct 1, 2024
Merged

Refactor image features selection in LlaVa#33696
ArthurZucker merged 6 commits intohuggingface:mainfrom
kenza-bouzid:kenzabouzid-refactor-image-features-selection-in-llava

Conversation

@kenza-bouzid
Copy link
Copy Markdown
Contributor

What does this PR do?

Wrap image features selection in LLaVa in a separate function to make it easier to override for custom use cases (e.g. applying a layer norm on the image features before projection, etc).

Fixes #33695

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker, @younesbelkada

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kenza-bouzid !

Thanks for opening a PR! Yes, actually it would be nice to have as a rule that VLMs have a method to get_image_features, where all logic happens. My prev idea was to make code less cluttered when the VLM has many special crops/pads/concats. But the use-case of overriding the method for custom models is very coool

We already have a similar pattern in video LLMs, for the same reason of uncluttering the code. WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

And, also, we might need to change more models to follow the same pattern to make CI happy. If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

@kenza-bouzid
Copy link
Copy Markdown
Contributor Author

kenza-bouzid commented Sep 26, 2024

Hi @zucchini-nlp,

Thanks for your reply.

WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

Good idea, however I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

We can have get_image_features and project_image_features that are then called sequentially in a get_image_embeddings ready to be merged with text embeddings as you suggested.

If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

@zucchini-nlp
Copy link
Copy Markdown
Member

I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

Hmm, imo that is a bit redundant, and since "select+proj" is mostly interlinked and isn't a huge piece of code we may ask users to override that one method, Independently of whether they tweak at selection stage or projection stage. So that we don't end up with separate methods that perform a one-liner, which imo is a bad practice. Lmk if you have any other ideas or objections

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

No worries, I just realized it is the very first llava being modified here. So our CI should be happy after running make fix-copies and make style. For other VLMs that are more tricky, I'll make one round of standardization later. Not a blocker at all :)

@kenza-bouzid
Copy link
Copy Markdown
Contributor Author

Sounds great! Let me make the changes you suggested. Thank youu!

self.vocab_size = model_embeds.num_embeddings
return model_embeds

def get_image_features(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp make fix-copies copied this over but looking at the forward pass of vipllava, the image features selection is slightly different since they select features from for the layers -2, -5, -8, -11 and 6
I don't think it's breaking anything since this function is not called. Note that we'll need vision_feature_layers: list[int] and not only a single one!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case better to make the method accept list in vipllava and add a "#Ignore copy" statement so that it is not copied from llava. That way we can use the get_image_features directly

I don't think it's a good idea to merge smth which doesn't work, even tho it technically doesn't break anything

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed 💯 fixed now in f75e944

@kenza-bouzid
Copy link
Copy Markdown
Contributor Author

kenza-bouzid commented Sep 30, 2024

@zucchini-nlp can you please approve/trigger the CI workflows for tests? Thank you!

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! LGTM in general, left one comment for vipllava

I'm approving this PR, when you're ready with vipllava feel free to tag @ LysandreJik (core maintainer). After his approval, PR can be merged :)

self.vocab_size = model_embeds.num_embeddings
return model_embeds

def get_image_features(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case better to make the method accept list in vipllava and add a "#Ignore copy" statement so that it is not copied from llava. That way we can use the get_image_features directly

I don't think it's a good idea to merge smth which doesn't work, even tho it technically doesn't break anything

@kenza-bouzid
Copy link
Copy Markdown
Contributor Author

@LysandreJik can you please have a look? Thanks

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested review from ArthurZucker and removed request for LysandreJik October 1, 2024 12:35
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the forward pass is pretty long, and this will come handy with modular soon!

@ArthurZucker ArthurZucker merged commit 88d9609 into huggingface:main Oct 1, 2024
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* refactor image features selection

* break line

* remove whitespace

* add pr comments: include projection and rename function

* make fix-copies

* fix get_image_feature in vip llava
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor image features selection in LlaVa

4 participants