[Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading by skyloevil · Pull Request #42825 · vllm-project/vllm

skyloevil · 2026-05-16T08:07:59Z

Summary

This PR fixes loading of pre-quantized BitsAndBytes 4-bit Gemma4 checkpoints where checkpoint weights are stored as packed uint8 tensors with BNB quant_state.

The loader now distinguishes target parameter capabilities before loading a pre-quantized 4-bit weight:

vLLM BNB 4-bit parameters keep the packed uint8 tensor and associated quant_state.
Normal torch parameters dequantize the packed BNB tensor before loading.

This avoids loading packed tensors such as [3096576, 1] directly into normal parameters such as [5376, 1152].

The PR also passes quant_config and the correct module prefix into Gemma4 multimodal embedders so embed_vision.embedding_projection can use the vLLM BNB Linear path when applicable.

Finally, it handles Gemma4 attention_k_eq_v full-attention layers for pre-quantized BNB checkpoints. These layers may have k_proj in the checkpoint without a separate v_proj; vLLM's regular weight path duplicates K into V.
For BNB pre-quantized weights, this PR mirrors that behavior for quant_state so the fused qkv_proj receives Q, K, and V quant states instead of producing a short Q+K output at runtime.

No open PR was found for this issue or the same fix area.

Tests

VLLM_USE_MODELSCOPE=False .venv/bin/python -m pytest tests/model_executor/model_loader/test_bitsandbytes_loader.py -q

6 passed

vLLM Serve Validation

This should be validated with the same model from #42813.

Start the server with a reduced context length for a functional loading check:

CUDA_VISIBLE_DEVICES=0 vllm serve "$MODEL_DIR" \
  --served-model-name gemma4-bnb \
  --host 0.0.0.0 \
  --port 8000 \
  --quantization bitsandbytes \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 1 \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --attention-backend TRITON_ATTN \
  --default-chat-template-kwargs '{"enable_thinking": false}'

The server should start without failing on packed BNB U8 weight shape mismatch for Gemma4 multimodal weights such as embed_vision.embedding_projection.weight.

Then verify the OpenAI-compatible server is healthy:

curl -s http://localhost:8000/health

Verify the model is registered:

curl -s http://localhost:8000/v1/models

Run a minimal completion request to verify the loaded model can serve:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-bnb",
    "messages": [
      {
        "role": "user",
        "content": "Reply with OK."
      }
    ],
    "max_tokens": 8,
    "temperature": 0
  }'

Expected result: the server starts successfully, /health returns success,/v1/models lists gemma4-bnb, and the chat completion request returns a normal response instead of a model-loading error.

Image + Text Multimodal Check

Create a simple image:

.venv/bin/python - <<'PY'
from PIL import Image, ImageDraw

img = Image.new("RGB", (512, 512), "white")
draw = ImageDraw.Draw(img)
draw.rectangle((120, 160, 390, 360), fill="red")
draw.text((150, 380), "RED BOX", fill="black")
img.save("/workspace/test.jpg")
PY

IMG_B64=$(base64 -w 0 /workspace/test.jpg)

Send an image+text request:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gemma4-bnb\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"image_url\",
            \"image_url\": {
              \"url\": \"data:image/jpeg;base64,$IMG_B64\"
            }
          },
          {
            \"type\": \"text\",
            \"text\": \"The image is attached above. What color is the main shape? Answer with only the color and object name.\"
          }
        ]
      }
    ],
    \"max_tokens\": 32,
    \"temperature\": 0
  }" | jq

Observed response:

This verifies that the image processor, vision tower,embed_vision.embedding_projection, language model, and OpenAI-compatible image+text request path still work after the BNB loading changes.

gemini-code-assist

Code Review

This pull request introduces 4-bit weight dequantization in BitsAndBytesModelLoader for parameters that do not support packed formats and updates the Gemma4 multi-modal model to propagate quantization settings. Review feedback suggests improving the robustness of parameter name resolution using rfind to prevent incorrect substring replacements, simplifying dictionary access with .get(), and leveraging the cached param_dict in other methods to eliminate redundant calls to named_parameters().

skyloevil · 2026-05-16T08:39:48Z

@codex

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

chatgpt-codex-connector · 2026-05-16T08:44:46Z

Codex Review: Something went wrong. Try again later by commenting “@codex review”.

Unknown error

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil · 2026-05-17T15:36:26Z

@codex review

chatgpt-codex-connector · 2026-05-17T15:42:59Z

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

skyloevil · 2026-05-18T03:12:15Z

@Isotr0py @DarkLight1337 PTAL~

Isotr0py · 2026-05-18T10:30:07Z

I'm busy with graduation defense these two days, will have a look tomorrow afternoon. Sry for the inconvenience. 😅

skyloevil · 2026-05-18T10:40:01Z

I'm busy with graduation defense these two days, will have a look tomorrow afternoon. Sry for the inconvenience.

No worries at all. Good luck with your graduation defense, and hope everything goes smoothly!

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil · 2026-05-21T16:01:33Z

Updated the PR accordingly.

Gemma4 vision/audio tower nn.Linear modules are now replaced with vLLM ReplicatedLinear under bitsandbytes quantization, so pre-quantized BNB weights can keep the packed representation instead of being silently dequantized.

I also changed the loader fallback to fail fast when a pre-quantized BNB 4-bit weight targets a parameter that was not initialized as a vLLM BNB-packed parameter. This should avoid silent degradation if a module misses quant_config or linear replacement.

The Gemma4 attention_k_eq_v quant_state duplication has also been moved out of the generic BNB loader into a Gemma4 model-side hook.

skyloevil · 2026-05-21T16:52:13Z

The PR fixes the startup mismatch, but packed BNB inference for Gemma4 vision tower still needs correctness validation.

color validation

cat validation

I tested the tower replacement path with simple image prompts. The server starts and accepts image inputs, but visual semantics are unreliable: red/blue rectangles are both answered as blue, OCR is incorrect, and a simple cat image is answered as None.

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify · 2026-05-21T17:13:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil · 2026-05-22T14:43:43Z

This update keeps the new batched Gemma4 vision path from main and only fixes the BNB activation dtype handling.

The issue is that HF Gemma4’s patch embedder casts image activations with:

pixel_values.to(self.input_proj.weight.dtype)

After replacing input_proj with vLLM BNB ReplicatedLinear, weight.dtype becomes the packed storage dtype uint8, not the compute dtype. That corrupts image activations before patch projection.

So this patch uses the model compute dtype for BNB image/video activations, while leaving the non-BNB path unchanged.

Problem solved.

skyloevil · 2026-05-22T15:39:06Z

                    f"checkpoint: {weights_not_loaded}"
                )
+
+        adjust_quant_state_dict = getattr(model, "adjust_bnb_quant_state_dict", None)


This hook keeps the generic BNB loader model-agnostic.

If a model has no special BNB quant-state requirements, nothing happens. If a model implements adjust_bnb_quant_state_dict, the loader gives it a chance to adjust quant_state_dict before the quant states are bound to parameters.

For Gemma4, this is used to handle the model-specific attention_k_eq_v case on the model side, instead of hard-coding Gemma4 logic into the generic BitsAndBytes loader.

skyloevil · 2026-05-22T15:51:09Z

After the review feedback, I updated the PR in three main ways:

Replaced Gemma4 vision/audio tower nn.Linear modules with vLLM ReplicatedLinear under BitsAndBytes quantization, instead of silently dequantizing pre-quantized BNB weights into regular torch parameters.
Moved the Gemma4-specific attention_k_eq_v BNB quant-state adjustment out of the generic BitsAndBytes loader and into Gemma4 model-side hooks, keeping the loader model-agnostic.
Fixed the BNB multimodal activation dtype path. HF Gemma4’s vision patch embedder casts image activations using input_proj.weight.dtype, but after BNB replacement that weight is packed uint8 storage, not a compute dtype.

I also verified the fix with vLLM serve on the BNB Gemma4 model. The previous red/blue image regression is resolved:

image=red  -> Red
image=blue -> Blue

@Isotr0py PTAL~

skyloevil · 2026-05-23T01:20:11Z

+    pixel_values: torch.Tensor,
+    pixel_position_ids: torch.Tensor,
+    padding_positions: torch.Tensor,
+    input_dtype: torch.dtype,


A subtle issue came from reusing HF Gemma4 vision code after replacing the vision tower nn.Linear modules with vLLM BNB ReplicatedLinear.

HF Gemma4’s patch embedder does this:

hidden_states = self.input_proj( pixel_values.to(self.input_proj.weight.dtype) )

That is fine when input_proj is a normal nn.Linear, because weight.dtype is a floating compute dtype such as bf16, fp16, or fp32.

After replacing input_proj with a vLLM BNB 4-bit linear, however, input_proj.weight is the packed NF4 storage tensor, so:

self.input_proj.weight.dtype == torch.uint8

As a result, the HF code casts image activations to uint8 before the patch projection. The image is still accepted by the API, but the visual features are corrupted, which explains why different images such as red and blue rectangles can produce the same wrong answer.

So the key lesson is: for BNB packed vLLM parameters, weight.dtype is a storage dtype, not an activation/compute dtype. The vision path must use the model compute dtype for pixel_values, not the packed weight dtype.

Isotr0py · 2026-05-23T15:19:26Z

Let's wait #43440 then we can simplify most of things in this PR :)

skyloevil · 2026-05-30T04:09:41Z

Let's wait #43440 then we can simplify most of things in this PR :)

Given #43798 now covers this fix in a simpler and more maintainable way, I think we can close #42825 and focus further on #43798.

mergify · 2026-06-02T06:15:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

skyloevil force-pushed the fix-bnb-prequant-mm-loading branch from a36710b to 8b90c77 Compare May 16, 2026 08:08

skyloevil changed the title ~~Fix BNB prequantized multimodal weight loading~~ [Fix] BNB prequantized multimodal weight loading May 16, 2026

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py Outdated

Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py

Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py

skyloevil and others added 4 commits May 16, 2026 16:42

Fix BNB prequantized multimodal weight loading

8c561c5

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Add BNB prequantized load path debug logs

3a93ec3

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Relax BNB loader debug log tests

721c25b

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Address BNB loader review feedback

ab6bbad

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil force-pushed the fix-bnb-prequant-mm-loading branch from c696e30 to ab6bbad Compare May 16, 2026 08:43

Handle Gemma4 k_eq_v BNB quant states

15c2957

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil marked this pull request as ready for review May 17, 2026 15:34

skyloevil requested a review from 22quinn as a code owner May 17, 2026 15:34

skyloevil changed the title ~~[Fix] BNB prequantized multimodal weight loading~~ [Fix] Gemma4 BNB prequantized multimodal weight loading May 17, 2026

skyloevil changed the title ~~[Fix] Gemma4 BNB prequantized multimodal weight loading~~ [Bugfix] Fix Gemma4 BNB prequantized multimodal weight loading May 17, 2026

mergify Bot added the bug Something isn't working label May 17, 2026

skyloevil mentioned this pull request May 17, 2026

[Bug]: Gemma under NF4 quantization fails to load with AssertionError: Tried to load weights of size torch.Size([3096576, 1])to a parameter of size torch.Size([5376, 1152]) #42813

Closed

1 task

Isotr0py self-assigned this May 18, 2026

Isotr0py reviewed May 19, 2026

View reviewed changes

Comment thread vllm/model_executor/models/gemma4_mm.py

Isotr0py reviewed May 21, 2026

View reviewed changes

Comment thread vllm/model_executor/model_loader/bitsandbytes_loader.py Outdated

Comment thread vllm/model_executor/models/gemma4_mm.py

skyloevil added 2 commits May 21, 2026 18:02

Address Gemma4 BNB quantization review

39216c7

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Remove BNB loader debug logs

cd06ce9

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Use model dtype for Gemma4 multimodal projection inputs

d9d9fdb

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify Bot added the needs-rebase label May 21, 2026

Cast Gemma4 BNB vision inputs to model dtype

42b9728

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil marked this pull request as draft May 21, 2026 17:37

Merge branch 'main' into fix-bnb-prequant-mm-loading

9bc1993

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

mergify Bot removed the needs-rebase label May 22, 2026

Fix Gemma4 BNB vision activation dtype

119d557

Co-authored-by: Codex <codex@openai.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

skyloevil commented May 22, 2026

View reviewed changes

skyloevil marked this pull request as ready for review May 22, 2026 15:43

skyloevil requested a review from Isotr0py May 22, 2026 15:43

skyloevil commented May 23, 2026

View reviewed changes

Isotr0py mentioned this pull request May 27, 2026

[Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl #43798

Merged

4 tasks

skyloevil marked this pull request as draft June 2, 2026 06:13

mergify Bot added the needs-rebase label Jun 2, 2026

skyloevil closed this Jun 2, 2026

Uh oh!

Conversation

skyloevil commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

vLLM Serve Validation

Image + Text Multimodal Check

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skyloevil commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 16, 2026

Uh oh!

skyloevil commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 17, 2026

Uh oh!

skyloevil commented May 18, 2026

Uh oh!

Isotr0py commented May 18, 2026

Uh oh!

skyloevil commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skyloevil commented May 21, 2026

Uh oh!

skyloevil commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 21, 2026

Uh oh!

skyloevil commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyloevil May 22, 2026

Choose a reason for hiding this comment

Uh oh!

skyloevil commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyloevil May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Isotr0py commented May 23, 2026

Uh oh!

skyloevil commented May 30, 2026

Uh oh!

mergify Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skyloevil commented May 16, 2026 •

edited

Loading

skyloevil commented May 21, 2026 •

edited

Loading

skyloevil commented May 22, 2026 •

edited

Loading

skyloevil commented May 22, 2026 •

edited

Loading