[Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading#42825
[Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading#42825skyloevil wants to merge 11 commits into
Conversation
a36710b to
8b90c77
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces 4-bit weight dequantization in BitsAndBytesModelLoader for parameters that do not support packed formats and updates the Gemma4 multi-modal model to propagate quantization settings. Review feedback suggests improving the robustness of parameter name resolution using rfind to prevent incorrect substring replacements, simplifying dictionary access with .get(), and leveraging the cached param_dict in other methods to eliminate redundant calls to named_parameters().
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
c696e30 to
ab6bbad
Compare
|
Codex Review: Something went wrong. Try again later by commenting “@codex review”. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@Isotr0py @DarkLight1337 PTAL~ |
|
I'm busy with graduation defense these two days, will have a look tomorrow afternoon. Sry for the inconvenience. 😅 |
No worries at all. Good luck with your graduation defense, and hope everything goes smoothly! |
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
Updated the PR accordingly. Gemma4 vision/audio tower I also changed the loader fallback to fail fast when a pre-quantized BNB 4-bit weight targets a parameter that was not initialized as a vLLM BNB-packed parameter. This should avoid silent degradation if a module misses The Gemma4 |
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
| f"checkpoint: {weights_not_loaded}" | ||
| ) | ||
|
|
||
| adjust_quant_state_dict = getattr(model, "adjust_bnb_quant_state_dict", None) |
There was a problem hiding this comment.
This hook keeps the generic BNB loader model-agnostic.
If a model has no special BNB quant-state requirements, nothing happens. If a model implements adjust_bnb_quant_state_dict, the loader gives it a chance to adjust quant_state_dict before the quant states are bound to parameters.
For Gemma4, this is used to handle the model-specific attention_k_eq_v case on the model side, instead of hard-coding Gemma4 logic into the generic BitsAndBytes loader.
|
After the review feedback, I updated the PR in three main ways:
I also verified the fix with vLLM serve on the BNB Gemma4 model. The previous red/blue image regression is resolved: @Isotr0py PTAL~ |
| pixel_values: torch.Tensor, | ||
| pixel_position_ids: torch.Tensor, | ||
| padding_positions: torch.Tensor, | ||
| input_dtype: torch.dtype, |
There was a problem hiding this comment.
A subtle issue came from reusing HF Gemma4 vision code after replacing the vision tower nn.Linear modules with vLLM BNB ReplicatedLinear.
HF Gemma4’s patch embedder does this:
hidden_states = self.input_proj(
pixel_values.to(self.input_proj.weight.dtype)
)That is fine when input_proj is a normal nn.Linear, because weight.dtype is a floating compute dtype such as bf16, fp16, or fp32.
After replacing input_proj with a vLLM BNB 4-bit linear, however, input_proj.weight is the packed NF4 storage tensor, so:
self.input_proj.weight.dtype == torch.uint8
As a result, the HF code casts image activations to uint8 before the patch projection. The image is still accepted by the API, but the visual features are corrupted, which explains why different images such as red and blue rectangles can produce the same wrong answer.
So the key lesson is: for BNB packed vLLM parameters, weight.dtype is a storage dtype, not an activation/compute dtype. The vision path must use the model compute dtype for pixel_values, not the packed weight dtype.
|
Let's wait #43440 then we can simplify most of things in this PR :) |
|
This pull request has merge conflicts that must be resolved before it can be |





Summary
Fixes #42813.
This PR fixes loading of pre-quantized BitsAndBytes 4-bit Gemma4 checkpoints where checkpoint weights are stored as packed uint8 tensors with BNB
quant_state.The loader now distinguishes target parameter capabilities before loading a pre-quantized 4-bit weight:
quant_state.This avoids loading packed tensors such as
[3096576, 1]directly into normal parameters such as[5376, 1152].The PR also passes
quant_configand the correct moduleprefixinto Gemma4 multimodal embedders soembed_vision.embedding_projectioncan use the vLLM BNB Linear path when applicable.Finally, it handles Gemma4
attention_k_eq_vfull-attention layers for pre-quantized BNB checkpoints. These layers may havek_projin the checkpoint without a separatev_proj; vLLM's regular weight path duplicates K into V.For BNB pre-quantized weights, this PR mirrors that behavior for
quant_stateso the fusedqkv_projreceives Q, K, and V quant states instead of producing a short Q+K output at runtime.No open PR was found for this issue or the same fix area.
Tests
vLLM Serve Validation
This should be validated with the same model from #42813.
Start the server with a reduced context length for a functional loading check:
The server should start without failing on packed BNB U8 weight shape mismatch for Gemma4 multimodal weights such as
embed_vision.embedding_projection.weight.Then verify the OpenAI-compatible server is healthy:
Verify the model is registered:
Run a minimal completion request to verify the loaded model can serve:
Expected result: the server starts successfully,
/healthreturns success,/v1/modelslistsgemma4-bnb, and the chat completion request returns a normal response instead of a model-loading error.Image + Text Multimodal Check
Create a simple image:
Send an image+text request:
Observed response:

This verifies that the image processor, vision tower,
embed_vision.embedding_projection, language model, and OpenAI-compatible image+text request path still work after the BNB loading changes.