Skip to content

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
lgeiger:tune-l40s
Nov 22, 2025
Merged

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217
vllm-bot merged 1 commit intovllm-project:mainfrom
lgeiger:tune-l40s

Conversation

@lgeiger
Copy link
Copy Markdown
Contributor

@lgeiger lgeiger commented Nov 22, 2025

Purpose

This tunes the triton w8a8 block fp8 kernels on a single L40s GPU with the shapes used in Qwen3-VL-32B-Instruct-FP8 which improves throughput by ~11.4% in the ShareGPT text-only benchmark.

Test Plan

vllm serve Qwen/Qwen3-VL-32B-Instruct-FP8 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 --max-model-len 24000 --no-enable-prefix-caching

vllm bench serve --backend vllm --model Qwen/Qwen3-VL-32B-Instruct-FP8 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Test Result

Before:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  305.45
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.27
Output token throughput (tok/s):         621.91
Peak output token throughput (tok/s):    1514.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1333.62
---------------Time to First Token----------------
Mean TTFT (ms):                          124986.70
Median TTFT (ms):                        123096.83
P99 TTFT (ms):                           257832.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.71
Median TPOT (ms):                        112.19
P99 TPOT (ms):                           554.13
---------------Inter-token Latency----------------
Mean ITL (ms):                           109.80
Median ITL (ms):                         73.75
P99 ITL (ms):                            523.20
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  274.13
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.65
Output token throughput (tok/s):         692.98
Peak output token throughput (tok/s):    1439.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1486.02
---------------Time to First Token----------------
Mean TTFT (ms):                          118620.50
Median TTFT (ms):                        121301.58
P99 TTFT (ms):                           239239.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.03
Median TPOT (ms):                        103.26
P99 TPOT (ms):                           531.54
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.96
Median ITL (ms):                         65.82
P99 ITL (ms):                            489.05
==================================================

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@mergify mergify Bot added the qwen Related to Qwen models label Nov 22, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds tuned Triton kernel configurations for w8a8 block fp8 on NVIDIA L40s GPUs, targeting shapes from the Qwen3-VL-32B model. The changes, consisting of four new JSON configuration files, are backed by benchmark results showing an ~11.4% throughput improvement. The new configurations are well-structured and follow existing conventions, representing a valuable performance optimization for this hardware and model combination.

@vllm-bot vllm-bot merged commit d045e22 into vllm-project:main Nov 22, 2025
7 of 8 checks passed
@lgeiger lgeiger deleted the tune-l40s branch November 22, 2025 01:50
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
…ject#29217)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants