[LoRA][Gemma4] Support vision tower LoRA#42662
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80413b1c2b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| padding_positions: torch.Tensor, | ||
| ) -> torch.Tensor: | ||
| pixel_values = 2 * (pixel_values - 0.5) | ||
| hidden_states = self.input_proj(pixel_values.to(self.input_proj.weight.dtype)) |
There was a problem hiding this comment.
Avoid reading
weight on quantized linear layers
When Gemma4 is loaded with a quantization method whose LinearMethod replaces weight (for example GGUF registers qweight/qweight_type instead of weight), image or video requests will fail here before the vision tower runs because self.input_proj.weight does not exist. Since this commit now passes quant_config into the vision tower, the patch embedder should not use the vLLM linear layer's weight attribute to choose the activation dtype.
Useful? React with 👍 / 👎.
|
Hi @linitra24, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request replaces the vision tower in the Gemma 4 multimodal model with native vLLM modules, including custom implementations for patch embedding, pooling, and multidimensional rotary embeddings. This change enables better integration with vLLM features like LoRA. A compatibility issue was identified where the use of the "strict=True" argument in zip() would cause failures on Python 3.9, which is currently supported by vLLM.
| unsqueeze_dim=unsqueeze_dim, | ||
| ) | ||
| for hidden_part, cos_part, sin_part in zip( | ||
| hidden_parts, cos_parts, sin_parts, strict=True |
There was a problem hiding this comment.
|
Documentation preview: https://vllm--42662.org.readthedocs.build/en/42662/ |
jeejeelee
left a comment
There was a problem hiding this comment.
To speed up this feature landing, maybe you can split the vision tower support into another PR.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--42662.org.readthedocs.build/en/42662/ |
This PR adds the remaining LoRA plumbing needed for Gemma4 multimodal LoRA support.
After #43798, Gemma4-MM vision linear layers are already converted through the Transformers backend path, so this PR no longer reimplements the Gemma4 vision tower. Instead, it focuses on the runtime LoRA mapping and token-counting pieces needed by Gemma4 image/video/audio inputs.
Main changes:
Test Plan
Additional end-to-end tests for real Gemma4 vision LoRA adapters should also be added in a follow-up.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.