[Kernels] Improve Triton fp8 block scaled kernel by lgeiger · Pull Request #29438 · vllm-project/vllm

lgeiger · 2025-11-25T19:54:57Z

Purpose

This PR aims to improve performance of the Triton fp8 block scaled w8a8 kernel.

It's probably best to review it commit by commit:

8d1c78b simplifies the pointer arithmetic
3415426 removes the masked load for cases where K % BLOCK_SIZE_K == 0. To me this seems to always be the case, but I might miss some edge cases so I kept a fallback to the previous behaviour when this condition isn't met.
91ef185 re-tunes the triton configs on a L40s following [Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s #29217

Test Plan

Correctness should be covered by tests/kernels/quantization/test_block_fp8.py and I also verified it with lm_eval for Qwen3-VL-2B-Instruct-FP8.

I tested performance on a L40s with Qwen3-VL-32B-Instruct-FP8:

vllm serve Qwen/Qwen3-VL-32B-Instruct-FP8 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 --max-model-len 24000 --no-enable-prefix-caching

vllm bench serve --backend vllm --model Qwen/Qwen3-VL-32B-Instruct-FP8 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Test Results

Before:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  282.88
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.54
Output token throughput (tok/s):         671.54
Peak output token throughput (tok/s):    1477.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1440.06
---------------Time to First Token----------------
Mean TTFT (ms):                          115438.24
Median TTFT (ms):                        119080.39
P99 TTFT (ms):                           239007.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.52
Median TPOT (ms):                        104.82
P99 TPOT (ms):                           540.94
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.46
Median ITL (ms):                         66.16
P99 ITL (ms):                            438.74
==================================================

After code changes:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  279.72
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.58
Output token throughput (tok/s):         679.12
Peak output token throughput (tok/s):    1707.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1456.30
---------------Time to First Token----------------
Mean TTFT (ms):                          117722.64
Median TTFT (ms):                        122529.00
P99 TTFT (ms):                           247590.06
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          133.14
Median TPOT (ms):                        106.13
P99 TPOT (ms):                           503.34
---------------Inter-token Latency----------------
Mean ITL (ms):                           106.04
Median ITL (ms):                         66.58
P99 ITL (ms):                            491.64
==================================================

After code changes and re-tuned:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  273.16
Total input tokens:                      217393
Total generated tokens:                  189963
Request throughput (req/s):              3.66
Output token throughput (tok/s):         695.43
Peak output token throughput (tok/s):    1644.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          1491.28
---------------Time to First Token----------------
Mean TTFT (ms):                          117644.34
Median TTFT (ms):                        123531.08
P99 TTFT (ms):                           240034.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          122.76
Median TPOT (ms):                        102.63
P99 TPOT (ms):                           511.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.85
Median ITL (ms):                         71.09
P99 ITL (ms):                            518.81
==================================================

Overall this improves throughput of Qwen3-VL-32B-Instruct-FP8 on a single L40s by 3.5% which is decent.

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

gemini-code-assist

Code Review

This pull request introduces performance improvements to the Triton fp8 block scaled kernel by simplifying pointer arithmetic and removing unnecessary masked loads. The changes are well-reasoned and backed by benchmark results. However, I've identified a pre-existing critical bug that could lead to out-of-bounds memory access when loading scale tensors. This occurs when the K dimension is not a multiple of BLOCK_SIZE_K. I've provided a comment with a suggested fix to address this correctness issue.

gemini-code-assist · 2025-11-25T19:59:41Z

+        a_s = tl.load(As_ptrs)
+        b_s = tl.load(Bs_ptrs)


There is a potential out-of-bounds memory access when loading the scale tensors a_s and b_s. This can happen when K is not perfectly divisible by BLOCK_SIZE_K, causing the loop to have a final iteration that accesses beyond the bounds of the scale tensors. The scale tensors As and Bs have a size of K // group_k along the K dimension. However, the access offset, which is effectively k * scale_step_k, can exceed this limit in the final loop iteration. This was also an issue in the previous implementation. To prevent this, we should add a mask to the scale loads.

Suggested change

a_s = tl.load(As_ptrs)

b_s = tl.load(Bs_ptrs)

scale_mask = k * scale_step_k < (K // group_k)

a_s = tl.load(As_ptrs, mask=scale_mask, other=0.0)

b_s = tl.load(Bs_ptrs, mask=scale_mask, other=0.0)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-25T20:01:34Z

        a_ptrs += BLOCK_SIZE_K * stride_ak
        b_ptrs += BLOCK_SIZE_K * stride_bk
+        As_ptrs += scale_step_k * stride_As_k
+        Bs_ptrs += scale_step_k * stride_Bs_k


Scale pointers stuck when BLOCK_SIZE_K < group_k

The new pointer-stepping logic multiplies scale_step_k = BLOCK_SIZE_K // group_k into the scale strides. When the K tile is smaller than the quantization block (e.g., block_shape [128,128] with tuned configs that set BLOCK_SIZE_K to 64), this integer division is zero, so As_ptrs/Bs_ptrs never advance and every K tile reuses the first scale block. That produces wrong scaling for K offsets beyond the first 128 elements, whereas the previous k_start // group_k offset handled the larger block correctly.

Useful? React with 👍 / 👎.

This sounds sensible. I have a look later

github-actions · 2026-02-24T02:14:38Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

lgeiger added 3 commits November 25, 2025 00:05

[Kernels] Improve pointer arithmetic of Triton fp8 block scaled kernel

8d1c78b

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Prevent unnecessary masking

3415426

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Retune L40s configs

91ef185

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

lgeiger requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 25, 2025 19:54

mergify Bot added the performance Performance-related issues label Nov 25, 2025

gemini-code-assist Bot reviewed Nov 25, 2025

View reviewed changes

chatgpt-codex-connector Bot reviewed Nov 25, 2025

View reviewed changes

BBuf mentioned this pull request Nov 29, 2025

Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 sgl-project/sglang#14141

Merged

6 tasks

github-actions Bot added the stale Over 90 days of inactivity label Feb 24, 2026

lgeiger closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernels] Improve Triton fp8 block scaled kernel#29438

[Kernels] Improve Triton fp8 block scaled kernel#29438
lgeiger wants to merge 3 commits intovllm-project:mainfrom
lgeiger:fp8-block-scaled-mm

lgeiger commented Nov 25, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Nov 25, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Nov 25, 2025

Uh oh!

lgeiger Nov 25, 2025

Uh oh!

github-actions Bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        a_s = tl.load(As_ptrs)
-        b_s = tl.load(Bs_ptrs)
+        scale_mask = k * scale_step_k < (K // group_k)
+        a_s = tl.load(As_ptrs, mask=scale_mask, other=0.0)
+        b_s = tl.load(Bs_ptrs, mask=scale_mask, other=0.0)

Uh oh!

Conversation

lgeiger commented Nov 25, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

lgeiger Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lgeiger commented Nov 25, 2025 •

edited by github-actions Bot

Loading