Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
@DarkLight1337 @ywang96 Thanks for all the fixes! |
Hi @WoosukKwon, can you explain the main difference between the two approaches ? Thanks :-) |
|
do we know that when the docker latest image will published? |
|
The next release is very soon: https://github.com/vllm-project/vllm/milestone/1 |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Richard Liu <ricliu@google.com>
|
How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer. |
Having the same problem |
|
Can you try setting |
|
@DarkLight1337 |
|
The error message shows that your |
|
@DarkLight1337 I understand that perfectly, but the Gemma3 model has a context window: Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size. |
|
I think here the max model len corresponds to the sliding window length, not the total window length |
|
In general, try to run the model with any other backend and you'll see that it doesn't work, while Xformers is terribly slow. |
|
When will v1 support gemma3? |
|
It's supported if you install from main branch, but there might be correctness issues because their attention mask is not fully implemented in V1. |
--max-model-len is model total context size, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html There are some problems with gemma 3 support currently. V0 doesn't support flashattention and context size is huge. And with V1 I wasn't able to load gptq quant. Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu |
Could you kindly share a code snippet? I am facing few issues too and that would help a ton! @anunknowperson 🙏 |
|
My launch string for openai server is CUDA_VISIBLE_DEVICES=0 vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --max-model-len 8192 --max-num-seqs 10 --gpu-memory-utilization=0.99Probably you can get params from here for code. Engine is V1, V0 context is too big. |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
|
@DarkLight1337 @WoosukKwon |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>


This PR adds the support for Gemma 3, an open-source vision-language model from Google.
NOTE:
Thanks for the help @ywang96 and @DarkLight1337!
FIX #14663