Fix attention propagation for vision towers of llava-like models#32221
Fix attention propagation for vision towers of llava-like models#32221qubvel wants to merge 3 commits intohuggingface:mainfrom
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for fixing this! Partially related to #30565
The only concern is that the current way we check supports_sdpa as a pretrainedModel's property is not correct. There are two concerns here:
- Currently it checks only
language_model.supports_sdpawhich means that if LM has SDPA and vision tower doesn't, it will fail to assign sdpa to vision - The above holds true, if we trust that the property
supports_sdpaworks correctly. recently I found that it can't go and check LM'ssupport_sdpaflag because the LM is not initialized at that step. So we don;t know what class will be LM and neither know its flags. Same goes true for vision tower. I am currently working on that, probably will have to be VLM specific check through text-config and vision-config, as we can deduce class type through config type. WDYT about it?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@zucchini-nlp If I get it right your concern is regarding the situation
Am I right? The question here is how should we specify attention implementation, should we dispatch it automatically or let the user make a choice? For example instead of passing |
|
@qubvel my idea to to dispatch automatically, as that is backwards compatible and I think most VLM-related models support SDPA already. We'll raise a warning if either one doesn't support SDPA, so that the user knows what's happening in the background. I already have some code for that, so that we infer SDPA flag from config and try to dispatch to whichever that supports it. But that means we'll have to slightly change tests, because tests skip based on uninitialized classes sdpa flag, which will be False by default for all Llavas. Can make a PR soon and let's see if that works for you |
|
Ok, that sounds good, let's wait until your PR is ready to see if we still need this fix! |
What does this PR do?
CLIPvision model now supports sdpa attention and automatically dispatch on sdpa if possible. Llava-like models utilize AutoModel.from_config to initialize vision tower, however,attn_implementationwas not propagated, which caused errors in CI: the model was in eager mode, and at the same time CLIP vision tower was with sdpa attention.cc @zucchini-nlp @ydshieh