[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s by lgeiger · Pull Request #29217 · vllm-project/vllm

lgeiger · 2025-11-22T01:24:02Z

Purpose

This tunes the triton w8a8 block fp8 kernels on a single L40s GPU with the shapes used in Qwen3-VL-32B-Instruct-FP8 which improves throughput by ~11.4% in the ShareGPT text-only benchmark.

Test Plan

vllm serve Qwen/Qwen3-VL-32B-Instruct-FP8 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 --max-model-len 24000 --no-enable-prefix-caching

vllm bench serve --backend vllm --model Qwen/Qwen3-VL-32B-Instruct-FP8 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Test Result

Before:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  305.45
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.27
Output token throughput (tok/s):         621.91
Peak output token throughput (tok/s):    1514.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1333.62
---------------Time to First Token----------------
Mean TTFT (ms):                          124986.70
Median TTFT (ms):                        123096.83
P99 TTFT (ms):                           257832.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.71
Median TPOT (ms):                        112.19
P99 TPOT (ms):                           554.13
---------------Inter-token Latency----------------
Mean ITL (ms):                           109.80
Median ITL (ms):                         73.75
P99 ITL (ms):                            523.20
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  274.13
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.65
Output token throughput (tok/s):         692.98
Peak output token throughput (tok/s):    1439.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1486.02
---------------Time to First Token----------------
Mean TTFT (ms):                          118620.50
Median TTFT (ms):                        121301.58
P99 TTFT (ms):                           239239.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.03
Median TPOT (ms):                        103.26
P99 TPOT (ms):                           531.54
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.96
Median ITL (ms):                         65.82
P99 ITL (ms):                            489.05
==================================================

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

gemini-code-assist

Code Review

This pull request adds tuned Triton kernel configurations for w8a8 block fp8 on NVIDIA L40s GPUs, targeting shapes from the Qwen3-VL-32B model. The changes, consisting of four new JSON configuration files, are backed by benchmark results showing an ~11.4% throughput improvement. The new configurations are well-structured and follow existing conventions, representing a valuable performance optimization for this hardware and model combination.

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s

f68098f

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

lgeiger requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 22, 2025 01:24

mergify Bot added the qwen Related to Qwen models label Nov 22, 2025

gemini-code-assist Bot reviewed Nov 22, 2025

View reviewed changes

DarkLight1337 approved these changes Nov 22, 2025

View reviewed changes

vllm-bot merged commit d045e22 into vllm-project:main Nov 22, 2025
7 of 8 checks passed

lgeiger deleted the tune-l40s branch November 22, 2025 01:50

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (vllm-pro…

1cd43be

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

lgeiger mentioned this pull request Nov 25, 2025

[Kernels] Improve Triton fp8 block scaled kernel #29438

Closed

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (vllm-pro…

0e2be05

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (vllm-pro…

c4b6ee5

…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217

[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217
vllm-bot merged 1 commit intovllm-project:mainfrom
lgeiger:tune-l40s

lgeiger commented Nov 22, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lgeiger commented Nov 22, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lgeiger commented Nov 22, 2025 •

edited by github-actions Bot

Loading