Skip to content

[Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading#42825

Closed
skyloevil wants to merge 11 commits into
vllm-project:mainfrom
skyloevil:fix-bnb-prequant-mm-loading
Closed

[Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading#42825
skyloevil wants to merge 11 commits into
vllm-project:mainfrom
skyloevil:fix-bnb-prequant-mm-loading

Conversation

@skyloevil

@skyloevil skyloevil commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #42813.

This PR fixes loading of pre-quantized BitsAndBytes 4-bit Gemma4 checkpoints where checkpoint weights are stored as packed uint8 tensors with BNB quant_state.

The loader now distinguishes target parameter capabilities before loading a pre-quantized 4-bit weight:

  • vLLM BNB 4-bit parameters keep the packed uint8 tensor and associated quant_state.
  • Normal torch parameters dequantize the packed BNB tensor before loading.

This avoids loading packed tensors such as [3096576, 1] directly into normal parameters such as [5376, 1152].

The PR also passes quant_config and the correct module prefix into Gemma4 multimodal embedders so embed_vision.embedding_projection can use the vLLM BNB Linear path when applicable.

Finally, it handles Gemma4 attention_k_eq_v full-attention layers for pre-quantized BNB checkpoints. These layers may have k_proj in the checkpoint without a separate v_proj; vLLM's regular weight path duplicates K into V.
For BNB pre-quantized weights, this PR mirrors that behavior for quant_state so the fused qkv_proj receives Q, K, and V quant states instead of producing a short Q+K output at runtime.

No open PR was found for this issue or the same fix area.

Tests

VLLM_USE_MODELSCOPE=False .venv/bin/python -m pytest tests/model_executor/model_loader/test_bitsandbytes_loader.py -q
image
6 passed

vLLM Serve Validation

This should be validated with the same model from #42813.

Start the server with a reduced context length for a functional loading check:

CUDA_VISIBLE_DEVICES=0 vllm serve "$MODEL_DIR" \
  --served-model-name gemma4-bnb \
  --host 0.0.0.0 \
  --port 8000 \
  --quantization bitsandbytes \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 1 \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --attention-backend TRITON_ATTN \
  --default-chat-template-kwargs '{"enable_thinking": false}'

The server should start without failing on packed BNB U8 weight shape mismatch for Gemma4 multimodal weights such as embed_vision.embedding_projection.weight.

Then verify the OpenAI-compatible server is healthy:

curl -s http://localhost:8000/health
image

Verify the model is registered:

curl -s http://localhost:8000/v1/models
image

Run a minimal completion request to verify the loaded model can serve:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-bnb",
    "messages": [
      {
        "role": "user",
        "content": "Reply with OK."
      }
    ],
    "max_tokens": 8,
    "temperature": 0
  }'
image

Expected result: the server starts successfully, /health returns success,/v1/models lists gemma4-bnb, and the chat completion request returns a normal response instead of a model-loading error.

Image + Text Multimodal Check

Create a simple image:

.venv/bin/python - <<'PY'
from PIL import Image, ImageDraw

img = Image.new("RGB", (512, 512), "white")
draw = ImageDraw.Draw(img)
draw.rectangle((120, 160, 390, 360), fill="red")
draw.text((150, 380), "RED BOX", fill="black")
img.save("/workspace/test.jpg")
PY

IMG_B64=$(base64 -w 0 /workspace/test.jpg)

Send an image+text request:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gemma4-bnb\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"image_url\",
            \"image_url\": {
              \"url\": \"data:image/jpeg;base64,$IMG_B64\"
            }
          },
          {
            \"type\": \"text\",
            \"text\": \"The image is attached above. What color is the main shape? Answer with only the color and object name.\"
          }
        ]
      }
    ],
    \"max_tokens\": 32,
    \"temperature\": 0
  }" | jq

Observed response:
image

This verifies that the image processor, vision tower,embed_vision.embedding_projection, language model, and OpenAI-compatible image+text request path still work after the BNB loading changes.

@skyloevil skyloevil force-pushed the fix-bnb-prequant-mm-loading branch from a36710b to 8b90c77 Compare May 16, 2026 08:08
@skyloevil skyloevil changed the title Fix BNB prequantized multimodal weight loading [Fix] BNB prequantized multimodal weight loading May 16, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces 4-bit weight dequantization in BitsAndBytesModelLoader for parameters that do not support packed formats and updates the Gemma4 multi-modal model to propagate quantization settings. Review feedback suggests improving the robustness of parameter name resolution using rfind to prevent incorrect substring replacements, simplifying dictionary access with .get(), and leveraging the cached param_dict in other methods to eliminate redundant calls to named_parameters().

Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py Outdated
Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py
Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py
@skyloevil

Copy link
Copy Markdown
Contributor Author

@codex

skyloevil and others added 4 commits May 16, 2026 16:42
Co-authored-by: Codex <codex@openai.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@skyloevil skyloevil force-pushed the fix-bnb-prequant-mm-loading branch from c696e30 to ab6bbad Compare May 16, 2026 08:43
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Something went wrong. Try again later by commenting “@codex review”.

Unknown error
ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@skyloevil skyloevil marked this pull request as ready for review May 17, 2026 15:34
@skyloevil skyloevil requested a review from 22quinn as a code owner May 17, 2026 15:34
@skyloevil

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@skyloevil skyloevil changed the title [Fix] BNB prequantized multimodal weight loading [Fix] Gemma4 BNB prequantized multimodal weight loading May 17, 2026
@skyloevil skyloevil changed the title [Fix] Gemma4 BNB prequantized multimodal weight loading [Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading May 17, 2026
@mergify mergify Bot added the bug Something isn't working label May 17, 2026
@skyloevil

Copy link
Copy Markdown
Contributor Author

@Isotr0py @DarkLight1337 PTAL~

@Isotr0py

Copy link
Copy Markdown
Member

I'm busy with graduation defense these two days, will have a look tomorrow afternoon. Sry for the inconvenience. 😅

@Isotr0py Isotr0py self-assigned this May 18, 2026
@skyloevil

Copy link
Copy Markdown
Contributor Author

I'm busy with graduation defense these two days, will have a look tomorrow afternoon. Sry for the inconvenience.

No worries at all. Good luck with your graduation defense, and hope everything goes smoothly!

Comment thread vllm/model_executor/models/gemma4_mm.py
Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py Outdated
Comment thread vllm/model_executor/models/gemma4_mm.py
skyloevil added 2 commits May 21, 2026 18:02
Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@skyloevil

Copy link
Copy Markdown
Contributor Author

Updated the PR accordingly.

Gemma4 vision/audio tower nn.Linear modules are now replaced with vLLM ReplicatedLinear under bitsandbytes quantization, so pre-quantized BNB weights can keep the packed representation instead of being silently dequantized.

I also changed the loader fallback to fail fast when a pre-quantized BNB 4-bit weight targets a parameter that was not initialized as a vLLM BNB-packed parameter. This should avoid silent degradation if a module misses quant_config or linear replacement.

The Gemma4 attention_k_eq_v quant_state duplication has also been moved out of the generic BNB loader into a Gemma4 model-side hook.

@skyloevil

skyloevil commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

The PR fixes the startup mismatch, but packed BNB inference for Gemma4 vision tower still needs correctness validation.

color validation
image

image

cat validation
image
image

I tested the tower replacement path with simple image prompts. The server starts and accepts image inputs, but visual semantics are unreliable: red/blue rectangles are both answered as blue, OCR is incorrect, and a simple cat image is answered as None.

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@mergify

mergify Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 21, 2026
Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@skyloevil skyloevil marked this pull request as draft May 21, 2026 17:37
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
@mergify mergify Bot removed the needs-rebase label May 22, 2026
Co-authored-by: Codex <codex@openai.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@skyloevil

skyloevil commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

This update keeps the new batched Gemma4 vision path from main and only fixes the BNB activation dtype handling.

The issue is that HF Gemma4’s patch embedder casts image activations with:

pixel_values.to(self.input_proj.weight.dtype)

After replacing input_proj with vLLM BNB ReplicatedLinear, weight.dtype becomes the packed storage dtype uint8, not the compute dtype. That corrupts image activations before patch projection.

So this patch uses the model compute dtype for BNB image/video activations, while leaving the non-BNB path unchanged.

c58432e74d45147fe51e46caf9237012

Problem solved.

f"checkpoint: {weights_not_loaded}"
)

adjust_quant_state_dict = getattr(model, "adjust_bnb_quant_state_dict", None)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hook keeps the generic BNB loader model-agnostic.

If a model has no special BNB quant-state requirements, nothing happens. If a model implements adjust_bnb_quant_state_dict, the loader gives it a chance to adjust quant_state_dict before the quant states are bound to parameters.

For Gemma4, this is used to handle the model-specific attention_k_eq_v case on the model side, instead of hard-coding Gemma4 logic into the generic BitsAndBytes loader.

@skyloevil skyloevil marked this pull request as ready for review May 22, 2026 15:43
@skyloevil skyloevil requested a review from Isotr0py May 22, 2026 15:43
@skyloevil

skyloevil commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

After the review feedback, I updated the PR in three main ways:

  1. Replaced Gemma4 vision/audio tower nn.Linear modules with vLLM ReplicatedLinear under BitsAndBytes quantization, instead of silently dequantizing pre-quantized BNB weights into regular torch parameters.

  2. Moved the Gemma4-specific attention_k_eq_v BNB quant-state adjustment out of the generic BitsAndBytes loader and into Gemma4 model-side hooks, keeping the loader model-agnostic.

  3. Fixed the BNB multimodal activation dtype path. HF Gemma4’s vision patch embedder casts image activations using input_proj.weight.dtype, but after BNB replacement that weight is packed uint8 storage, not a compute dtype.

I also verified the fix with vLLM serve on the BNB Gemma4 model. The previous red/blue image regression is resolved:

image=red  -> Red
image=blue -> Blue

@Isotr0py PTAL~

pixel_values: torch.Tensor,
pixel_position_ids: torch.Tensor,
padding_positions: torch.Tensor,
input_dtype: torch.dtype,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subtle issue came from reusing HF Gemma4 vision code after replacing the vision tower nn.Linear modules with vLLM BNB ReplicatedLinear.

HF Gemma4’s patch embedder does this:

hidden_states = self.input_proj(
    pixel_values.to(self.input_proj.weight.dtype)
)

That is fine when input_proj is a normal nn.Linear, because weight.dtype is a floating compute dtype such as bf16, fp16, or fp32.

After replacing input_proj with a vLLM BNB 4-bit linear, however, input_proj.weight is the packed NF4 storage tensor, so:

self.input_proj.weight.dtype == torch.uint8

As a result, the HF code casts image activations to uint8 before the patch projection. The image is still accepted by the API, but the visual features are corrupted, which explains why different images such as red and blue rectangles can produce the same wrong answer.

So the key lesson is: for BNB packed vLLM parameters, weight.dtype is a storage dtype, not an activation/compute dtype. The vision path must use the model compute dtype for pixel_values, not the packed weight dtype.

@Isotr0py

Copy link
Copy Markdown
Member

Let's wait #43440 then we can simplify most of things in this PR :)

@skyloevil

Copy link
Copy Markdown
Contributor Author

Let's wait #43440 then we can simplify most of things in this PR :)

Given #43798 now covers this fix in a simpler and more maintainable way, I think we can close #42825 and focus further on #43798.

@skyloevil skyloevil marked this pull request as draft June 2, 2026 06:13
@mergify

mergify Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 2, 2026
@skyloevil skyloevil closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase

Projects

None yet

2 participants