[Kernel] Unified attention kernel performance tuning by cagrikymk · Pull Request #28497 · vllm-project/vllm

cagrikymk · 2025-11-11T22:30:26Z

Purpose

The purpose of the PR is to optimize the unified attention kernel and restructures the config. selection process.
This PR creates functions to provide config for 2d, 3d attn. kernels and the selection logic between them.

Besides adding AMD specific configs:

Changing tl.exp to tl.math.exp2 and adding related scaling
Adding conditional cache modifiers for q and K/V loading to better use the caches
Reordering the grid for the 2D kernel for better XCD locality
Masking related simplifications (HEAD_SIZE and TILE_SIZE related masking)

Evaluation

Server:

export VLLM_ROCM_USE_AITER=1
export VLLM_DISABLE_COMPILE_CACHE=1
export VLLM_ROCM_USE_AITER_MHA=0
vllm serve $model \
--tensor-parallel 1 \
--no-enable-prefix-caching --disable-log-requests \
--block-size 64 \
--compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]}' \
--swap-space 16 \
--gpu-memory-utilization 0.95 \
--async-scheduling

GPT-OSS-120b Performance

TP1, ISL=8K, OSL=128 and run on MI300.

Concurrency	4	8	16	32	64
Main	166.76	196.1	216.56	236.14	250.22
This PR	203.11	255.97	293.68	321.33	340.95
Speedup	1.22	1.31	1.36	1.36	1.36

GPT-OSS-120b Accuracy

lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,max_gen_toks=2048 --tasks gsm8k --num_fewshot 5 --batch_size 64 --apply_chat_template

This PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9447|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.7316|±  |0.0122|

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.7559|±  |0.0118|

Llama-3.1-8B-Instruct-FP8-KV Performance

TP1, ISL=8K, OSL=128 and run on MI300.

Concurrency	4	8	16	32	64
Main	219	246.82	249.88	269.96	279.41
This PR	307.05	400.46	442.9	492.49	493.99
Speedup	1.40	1.62	1.77	1.82	1.77

Llama-3.1-8B-Instruct-FP8-KV Accuracy

lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions --tasks gsm8k --num_fewshot 5 --batch_size 64
This PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7544|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.7074|±  |0.0125|

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7544|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.7149|±  |0.0124|

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/attention/ops/triton_unified_attention.py

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

heheda12345 · 2025-11-14T07:55:31Z

@tdoublep can you help to review?

jvlunteren · 2025-11-18T09:37:25Z

The following are benchmark results obtained on an NVIDIA H100 GPU.

vllm bench latency --model meta-llama/Llama-3.1-8B-Instruct --batch-size 1 --input-len 500 --output-len <variable>

output length	this PR	current upstream
10	0.074671246	0.074029059
100	0.669249352	0.668105140
200	1.332115024	1.331059061
400	2.6539181862	2.647284727
800	5.3229931470	5.323177620
1600	10.7362919200	10.705448502
3200	21.6598491859	21.842782271

vllm bench latency --model meta-llama/Llama-3.1-8B-Instruct --batch-size <variable> --input-len 500 --output-len 100

batch size	this PR	current upstream
1	0.675358073	0.669666939
2	0.697075157	0.690474397
4	0.724062299	0.727444378
8	0.797777366	0.850724417
16	0.945458471	0.951328797
32	1.1578174594	1.167803896
64	1.6520638151	1.656346207
128	2.7003289999	2.705310013
256	4.8188396936	4.837702615

vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests

vllm bench serve --model meta-llama/Llama-3.1-8B-Instruct --dataset-name sharegpt \
                 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

This PR:

============ Serving Benchmark Result ============
Successful requests:                     984
Failed requests:                         16
Benchmark duration (s):                  19.56
Total input tokens:                      211284
Total generated tokens:                  194325
Request throughput (req/s):              50.32
Output token throughput (tok/s):         9936.97
Peak output token throughput (tok/s):    20576.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          20741.15
---------------Time to First Token----------------
Mean TTFT (ms):                          3336.41
Median TTFT (ms):                        3269.26
P99 TTFT (ms):                           6205.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.15
Median TPOT (ms):                        44.26
P99 TPOT (ms):                           210.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.44
Median ITL (ms):                         23.97
P99 ITL (ms):                            215.13
==================================================

Current upstream:

============ Serving Benchmark Result ============
Successful requests:                     984
Failed requests:                         16
Benchmark duration (s):                  19.16
Total input tokens:                      209644
Total generated tokens:                  194099
Request throughput (req/s):              51.36
Output token throughput (tok/s):         10131.23
Peak output token throughput (tok/s):    20148.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          21073.84
---------------Time to First Token----------------
Mean TTFT (ms):                          3368.81
Median TTFT (ms):                        3345.50
P99 TTFT (ms):                           6208.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.77
Median TPOT (ms):                        43.73
P99 TPOT (ms):                           211.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.38
Median ITL (ms):                         24.00
P99 ITL (ms):                            213.75
==================================================

For these experiments on an NVIDIA H100 GPU, no clear performance differences were observed.

tdoublep

I'm totally fine with introducing the AMD-specific configs. This is important to get platform portability without using introducing auto-tuning overhead.

I have some concerns about whether some of the other changes are really necessary (e.g. log2 stuff, masking changes). On H100 they don't really seem to have any effect.

tdoublep · 2025-11-19T13:07:13Z

vllm/attention/ops/triton_unified_attention.py

+    if ALL_DECODE or num_query_heads <= BLOCK_M:
+        Q_cache_modifier: tl.constexpr = ".cg"
+    else:
+        Q_cache_modifier: tl.constexpr = ""


Could we add a note explaining why this is expected to help?

tdoublep · 2025-11-19T13:08:41Z

vllm/attention/ops/triton_unified_attention.py

+    if HEAD_SIZE_PADDED != HEAD_SIZE:
+        dim_mask = offs_d < HEAD_SIZE
+    else:
+        dim_mask = tl.full((1,), 1, dtype=tl.int1)


Is there a significant performance improvement from this specific change?

This one has ignorable benefit, will remove to keep the diff minimal

tdoublep · 2025-11-19T13:09:05Z

vllm/attention/ops/triton_unified_attention.py

+        if TILE_SIZE == BLOCK_SIZE:
+            tile_mask = tl.full((1,), 1, dtype=tl.int1)
+        else:
+            tile_mask = seq_offset < max_seq_prefix_len


Same question here

tdoublep · 2025-11-19T13:09:51Z

vllm/attention/ops/triton_unified_attention.py

+            # softcap here uses exp2 and consumes RCP_LN2 conversion.
+            # multiply by RCP_LN2 again to be used in later exp2
+            S = apply_softcap(S, softcap) * RCP_LN2


This log2 change makes the algorithm harder to follow. Could we quantify what this specific change brings to the performance?

cagrikymk · 2025-11-21T21:43:42Z

Here is a table summarizing the benefit of the following changes on MI300:

No dim mask: Removing the dim. mask trick
No exp2: Replacing exp2 with exp
No q cg: Removing cache modifier from the q load
No tile mask: Removing the tile mask trick

I can do (1) and (3) as their effect is minimal if you prefer to minimize the diff.

What do you think @tdoublep?

Num KV heads: 8
Num Q heads: 64
Dtype: bf16

window_size	seq q len	seq kv len	batch size	No Dim Mask	No exp2	No q cg	No Tile Mask
0	1	4096	8	0.95	0.96	1.00	0.97
0	1	4096	16	0.98	0.96	1.00	0.98
0	1	4096	32	1.00	0.97	1.00	1.01
0	1	4096	64	1.00	0.97	1.00	0.99
0	1	4096	128	1.00	0.90	1.00	1.00
0	4096	4096	1	1.00	0.86	1.00	0.98
0	4096	4096	2	1.00	0.83	1.00	0.95
128	1	4096	8	1.01	0.96	1.02	0.95
128	1	4096	16	0.99	0.96	1.00	0.93
128	1	4096	32	0.99	0.93	0.99	0.93
128	1	4096	64	1.00	0.92	0.99	0.95
128	1	4096	128	0.99	0.94	0.99	0.95
128	4096	4096	1	1.00	0.92	0.94	1.01
128	4096	4096	2	1.00	0.91	0.96	0.99
			AVG:	0.99	0.93	0.99	0.97
			MAX:	1.01	0.97	1.02	1.01
			MIN:	0.95	0.83	0.94	0.93

tdoublep · 2025-12-15T10:34:04Z

@cagrikymk Sorry for slow response on this one. Thank you for the detailed experiments!! If I understand correctly, lower is better in this table right?

cagrikymk · 2025-12-15T21:32:52Z

@tdoublep Oh, I forgot to include some important details.

These numbers are the speedup against the current code in the PR.

Each column undoes a change and shows how much the perf. is affected by it.

So, if the number is low, it means it is needed for good perf., hence, as you said, lower is better as that indicates the importance of keeping that optimization.

mgehre-amd · 2025-12-17T22:58:05Z

With this PR, Triton attention gets faster but is still slower than using VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN on gfx1151:

# This PR
Preprocessing 34171.30 tokens/s; TTFT 116 ms
Decode 44.23 tokens/s

After adding the following changes on top

select_2d_config, select_3d_config: Use TILE_SIZE = block_size on RDNA (from [ROCm] Restore 16-wide fast path in Triton unified attention #30582)
select_3d_config: MIN_SEGMENTS = 16 if TILE_SIZE <= 16 else 8 on RDNA
use_2d_kernel: Force cu_mult = 4 on RDNA
Kernel fast path (in both 2D and 3D kernels) from [ROCm] Restore 16-wide fast path in Triton unified attention #30582: When TILE_SIZE == BLOCK_SIZE:
- Direct block table lookup with j instead of seq_offset // BLOCK_SIZE
- Direct offs_t indexing instead of seq_offset % BLOCK_SIZE
it's on par with VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN:

# After the changes above
Preprocessing 57191.28 tokens/s; TTFT 69 ms
Decode 51.30 tokens/s

Test case is

vllm serve Qwen/Qwen3-4B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.38 --max-num-batched-tokens 128
vllm bench serve --model Qwen/Qwen3-4B-AWQ --num-warmups 5 --random-input-len 3968 --random-output-len 128 --num-prompts 1 --max-concurrency 1

AndreasKaratzas · 2026-01-30T17:44:28Z

@cagrikymk Is this PR stale?

cagrikymk added 3 commits November 11, 2025 22:24

add unified attention optimizations

dd10355

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

formatting

72bd6e0

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

formatting

586e4fb

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

cagrikymk marked this pull request as ready for review November 11, 2025 23:31

cagrikymk requested a review from tdoublep as a code owner November 11, 2025 23:31

chatgpt-codex-connector bot reviewed Nov 11, 2025

View reviewed changes

vllm/attention/ops/triton_unified_attention.py Show resolved Hide resolved

remove constexpr from scale

99b724d

Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>

tdoublep reviewed Nov 19, 2025

View reviewed changes

schung-amd mentioned this pull request Dec 17, 2025

[Issue]: vLLM multi-node fails to start Radeon R9700 ROCm/ROCm#5656

Closed

gshtras mentioned this pull request Jan 28, 2026

[ROCm] Change default settings for ROCm #33271

Closed

will-deines mentioned this pull request Mar 24, 2026

[Attention][GPT-OSS] Prefer Triton on SM8x and narrow SM89 sink-prefill tuning #37949

Draft

Uh oh!

Conversation

cagrikymk commented Nov 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Evaluation

GPT-OSS-120b Performance

GPT-OSS-120b Accuracy

Llama-3.1-8B-Instruct-FP8-KV Performance

Llama-3.1-8B-Instruct-FP8-KV Accuracy

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

heheda12345 commented Nov 14, 2025

Uh oh!

jvlunteren commented Nov 18, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cagrikymk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cagrikymk commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Dec 15, 2025

Uh oh!

cagrikymk commented Dec 15, 2025

Uh oh!

mgehre-amd commented Dec 17, 2025

Uh oh!

AndreasKaratzas commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cagrikymk commented Nov 11, 2025 •

edited by github-actions bot

Loading

cagrikymk commented Nov 21, 2025 •

edited

Loading