[Model] Add support for Gemma 3 by WoosukKwon · Pull Request #14660 · vllm-project/vllm

WoosukKwon · 2025-03-12T07:13:59Z

This PR adds the support for Gemma 3, an open-source vision-language model from Google.

NOTE:

The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by a followup PR cc @DarkLight1337
For the text-only inputs, both V0 and V1 should produce accurate outputs with good performance.
For image inputs, only V0 implements the attention in the correct way. Gemma 3 uses bidirectional attention only for the image tokens, which is not supported efficiently by any of the current attention backends. Therefore, we temporarily use the naive PyTorch SDPA with masking tensors. This could lead to significant memory usage for long prompts (w/ images).
For V1, we currently do not strictly follow the original attention in Gemma 3. The model still generates reasonable outputs, but this needs to be fixed to get the full accuracy.

Thanks for the help @ywang96 and @DarkLight1337!

FIX #14663

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon · 2025-03-12T15:54:50Z

@DarkLight1337 @ywang96 Thanks for all the fixes!

erdaltoprak · 2025-03-12T16:21:44Z

For V1, we currently do not strictly follow the original attention in Gemma 3. The model still generates reasonable outputs, but this needs to be fixed to get the full accuracy

Hi @WoosukKwon, can you explain the main difference between the two approaches ? Thanks :-)

Dilesh-chouhan · 2025-03-14T15:25:27Z

do we know that when the docker latest image will published?

DarkLight1337 · 2025-03-14T15:33:47Z

The next release is very soon: https://github.com/vllm-project/vllm/milestone/1

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Richard Liu <ricliu@google.com>

Swipe4057 · 2025-03-19T12:05:14Z

How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer.

francis2tm · 2025-03-19T15:00:21Z

How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer.

Having the same problem

DarkLight1337 · 2025-03-19T15:03:49Z

Can you try setting VLLM_USE_V1=0 to enable more backends?

Swipe4057 · 2025-03-19T15:51:11Z

@DarkLight1337

and

DarkLight1337 · 2025-03-19T15:52:32Z

The error message shows that your --max-model-len is too high

Swipe4057 · 2025-03-19T15:54:54Z

@DarkLight1337 I understand that perfectly, but the Gemma3 model has a context window: Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.

DarkLight1337 · 2025-03-19T15:56:52Z

I think here the max model len corresponds to the sliding window length, not the total window length

Swipe4057 · 2025-03-19T16:00:02Z

In general, try to run the model with any other backend and you'll see that it doesn't work, while Xformers is terribly slow.

xihuai18 · 2025-03-21T14:31:12Z

When will v1 support gemma3?

DarkLight1337 · 2025-03-21T14:52:31Z

It's supported if you install from main branch, but there might be correctness issues because their attention mask is not fully implemented in V1.

anunknowperson · 2025-03-22T23:18:26Z

I think here the max model len corresponds to the sliding window length, not the total window length

--max-model-len is model total context size, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

There are some problems with gemma 3 support currently. V0 doesn't support flashattention and context size is huge. And with V1 I wasn't able to load gptq quant.

Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu
Edit: I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though I expected more context to fit since model is 16gb, so there should be 8 gb of free vram for context.

pietrobolcato · 2025-03-23T09:41:15Z

Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu
Edit: I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though I expected more context to fit since model is 16gb, so there should be 8 gb of free vram for context.

Could you kindly share a code snippet? I am facing few issues too and that would help a ton! @anunknowperson 🙏

anunknowperson · 2025-03-23T15:35:31Z

@pietrobolcato

My launch string for openai server is

CUDA_VISIBLE_DEVICES=0 vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --max-model-len 8192 --max-num-seqs 10 --gpu-memory-utilization=0.99

Probably you can get params from here for code. Engine is V1, V0 context is too big.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

hahmad2008 · 2025-04-07T16:35:12Z

@DarkLight1337 @WoosukKwon
could you please check this issue related to Gemma3-AWQ?

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

WoosukKwon added 30 commits March 4, 2025 01:38

Gemma3 1B working

046e05a

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Add modeling_gemma3.py

669ae5a

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

remove

e044645

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[TMP] Add HF Gemma 3

bd78da5

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Fix config:

a50c2d5

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[TMP] image input

d2562cb

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Update

017239e

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Update

de0136a

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

update

12b7e9d

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

26b6199

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Remove

b99336a

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

fix

004dc92

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Add kwargs

2bb965b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Hew HF

b119945

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Fix scaling

4c67573

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

bidirectional attn

d90d410

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

sliding window

366e4b5

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Remove HF

77b9dd7

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

revert

f0f8e9d

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

add placeholder str

28e757b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into woosuk-gemma3

a935b24

minor

285ffc4

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Add comments

713766b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Minor

6746086

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

cleanup

6fa0336

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

tmp

0384ceb

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

2beb199

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Docs

64ef15f

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

ruff

d92c7c1

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Update transformers version

82acdcd

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

vhain mentioned this pull request Mar 27, 2025

gemma3: impl get_attention_sliding_window_size for attn init sgl-project/sglang#4823

Merged

6 tasks

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

terrykong mentioned this pull request Apr 21, 2025

gemma-3 NVIDIA-NeMo/RL#236

Closed

2 tasks

zzhbrr mentioned this pull request Sep 3, 2025

[Performance] Remove instances of torch.nonzero() where appropriate sgl-project/sglang#9889

Open

Uh oh!

Conversation

WoosukKwon commented Mar 12, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Mar 12, 2025

Uh oh!

erdaltoprak commented Mar 12, 2025

Uh oh!

Dilesh-chouhan commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 14, 2025

Uh oh!

Swipe4057 commented Mar 19, 2025

Uh oh!

francis2tm commented Mar 19, 2025

Uh oh!

DarkLight1337 commented Mar 19, 2025

Uh oh!

Swipe4057 commented Mar 19, 2025

Uh oh!

DarkLight1337 commented Mar 19, 2025

Uh oh!

Swipe4057 commented Mar 19, 2025

Uh oh!

DarkLight1337 commented Mar 19, 2025

Uh oh!

Swipe4057 commented Mar 19, 2025

Uh oh!

xihuai18 commented Mar 21, 2025

Uh oh!

DarkLight1337 commented Mar 21, 2025

Uh oh!

anunknowperson commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietrobolcato commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anunknowperson commented Mar 23, 2025

Uh oh!

hahmad2008 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

WoosukKwon commented Mar 12, 2025 •

edited by github-actions Bot

Loading

Dilesh-chouhan commented Mar 14, 2025 •

edited

Loading

anunknowperson commented Mar 22, 2025 •

edited

Loading

pietrobolcato commented Mar 23, 2025 •

edited

Loading

hahmad2008 commented Apr 7, 2025 •

edited

Loading