[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217
Merged
vllm-bot merged 1 commit intovllm-project:mainfrom Nov 22, 2025
Merged
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s#29217vllm-bot merged 1 commit intovllm-project:mainfrom
vllm-bot merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds tuned Triton kernel configurations for w8a8 block fp8 on NVIDIA L40s GPUs, targeting shapes from the Qwen3-VL-32B model. The changes, consisting of four new JSON configuration files, are backed by benchmark results showing an ~11.4% throughput improvement. The new configurations are well-structured and follow existing conventions, representing a valuable performance optimization for this hardware and model combination.
DarkLight1337
approved these changes
Nov 22, 2025
ywang96
pushed a commit
to ywang96/vllm
that referenced
this pull request
Nov 23, 2025
…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
RunkaiTao
pushed a commit
to RunkaiTao/vllm
that referenced
this pull request
Nov 24, 2025
…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio
pushed a commit
to SumanthRH/vllm
that referenced
this pull request
Nov 29, 2025
…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
kitaekatt
pushed a commit
to kitaekatt/vllm
that referenced
this pull request
Dec 1, 2025
…ject#29217) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This tunes the triton w8a8 block fp8 kernels on a single L40s GPU with the shapes used in
Qwen3-VL-32B-Instruct-FP8which improves throughput by ~11.4% in the ShareGPT text-only benchmark.Test Plan
Test Result
Before:
After: