[Kernel] Unified attention kernel performance tuning#28497
[Kernel] Unified attention kernel performance tuning#28497cagrikymk wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>
Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>
Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com>
|
@tdoublep can you help to review? |
|
The following are benchmark results obtained on an NVIDIA H100 GPU.
This PR: ============ Serving Benchmark Result ============ Successful requests: 984 Failed requests: 16 Benchmark duration (s): 19.56 Total input tokens: 211284 Total generated tokens: 194325 Request throughput (req/s): 50.32 Output token throughput (tok/s): 9936.97 Peak output token throughput (tok/s): 20576.00 Peak concurrent requests: 984.00 Total Token throughput (tok/s): 20741.15 ---------------Time to First Token---------------- Mean TTFT (ms): 3336.41 Median TTFT (ms): 3269.26 P99 TTFT (ms): 6205.97 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 79.15 Median TPOT (ms): 44.26 P99 TPOT (ms): 210.99 ---------------Inter-token Latency---------------- Mean ITL (ms): 35.44 Median ITL (ms): 23.97 P99 ITL (ms): 215.13 ================================================== Current upstream: ============ Serving Benchmark Result ============ Successful requests: 984 Failed requests: 16 Benchmark duration (s): 19.16 Total input tokens: 209644 Total generated tokens: 194099 Request throughput (req/s): 51.36 Output token throughput (tok/s): 10131.23 Peak output token throughput (tok/s): 20148.00 Peak concurrent requests: 984.00 Total Token throughput (tok/s): 21073.84 ---------------Time to First Token---------------- Mean TTFT (ms): 3368.81 Median TTFT (ms): 3345.50 P99 TTFT (ms): 6208.56 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 78.77 Median TPOT (ms): 43.73 P99 TPOT (ms): 211.84 ---------------Inter-token Latency---------------- Mean ITL (ms): 35.38 Median ITL (ms): 24.00 P99 ITL (ms): 213.75 ================================================== For these experiments on an NVIDIA H100 GPU, no clear performance differences were observed. |
tdoublep
left a comment
There was a problem hiding this comment.
I'm totally fine with introducing the AMD-specific configs. This is important to get platform portability without using introducing auto-tuning overhead.
I have some concerns about whether some of the other changes are really necessary (e.g. log2 stuff, masking changes). On H100 they don't really seem to have any effect.
| if ALL_DECODE or num_query_heads <= BLOCK_M: | ||
| Q_cache_modifier: tl.constexpr = ".cg" | ||
| else: | ||
| Q_cache_modifier: tl.constexpr = "" |
There was a problem hiding this comment.
Could we add a note explaining why this is expected to help?
| if HEAD_SIZE_PADDED != HEAD_SIZE: | ||
| dim_mask = offs_d < HEAD_SIZE | ||
| else: | ||
| dim_mask = tl.full((1,), 1, dtype=tl.int1) |
There was a problem hiding this comment.
Is there a significant performance improvement from this specific change?
There was a problem hiding this comment.
This one has ignorable benefit, will remove to keep the diff minimal
| if TILE_SIZE == BLOCK_SIZE: | ||
| tile_mask = tl.full((1,), 1, dtype=tl.int1) | ||
| else: | ||
| tile_mask = seq_offset < max_seq_prefix_len |
| # softcap here uses exp2 and consumes RCP_LN2 conversion. | ||
| # multiply by RCP_LN2 again to be used in later exp2 | ||
| S = apply_softcap(S, softcap) * RCP_LN2 |
There was a problem hiding this comment.
This log2 change makes the algorithm harder to follow. Could we quantify what this specific change brings to the performance?
|
Here is a table summarizing the benefit of the following changes on MI300:
I can do (1) and (3) as their effect is minimal if you prefer to minimize the diff. What do you think @tdoublep? Num KV heads: 8
|
|
@cagrikymk Sorry for slow response on this one. Thank you for the detailed experiments!! If I understand correctly, lower is better in this table right? |
|
@tdoublep Oh, I forgot to include some important details. These numbers are the speedup against the current code in the PR. Each column undoes a change and shows how much the perf. is affected by it. So, if the number is low, it means it is needed for good perf., hence, as you said, lower is better as that indicates the importance of keeping that optimization. |
|
With this PR, Triton attention gets faster but is still slower than using After adding the following changes on top
Test case is |
|
@cagrikymk Is this PR stale? |
Purpose
The purpose of the PR is to optimize the unified attention kernel and restructures the config. selection process.
This PR creates functions to provide config for 2d, 3d attn. kernels and the selection logic between them.
Besides adding AMD specific configs:
tl.exptotl.math.exp2and adding related scalingEvaluation
Server:
GPT-OSS-120b Performance
TP1, ISL=8K, OSL=128 and run on MI300.
GPT-OSS-120b Accuracy
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,max_gen_toks=2048 --tasks gsm8k --num_fewshot 5 --batch_size 64 --apply_chat_templateThis PR:
Main:
Llama-3.1-8B-Instruct-FP8-KV Performance
TP1, ISL=8K, OSL=128 and run on MI300.
Llama-3.1-8B-Instruct-FP8-KV Accuracy
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions --tasks gsm8k --num_fewshot 5 --batch_size 64This PR:
Main: