Fix kernel cache miss and add RDNA configs by hyoon1 · Pull Request #246 · ROCm/vllm

hyoon1 · 2024-10-25T17:40:05Z

added Navi configurations (Related PR: add RDNA Config triton#640)
resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0

maleksan85 · 2024-11-13T17:28:00Z

vllm/attention/ops/triton_flash_attention.py

what is the reason to zero seq lens?

Below attention fwd kernel is called when we run the model with vllm:

vllm/vllm/attention/ops/triton_flash_attention.py

Line 776 in efb0432

attn_fwd[grid](

However, MAX_SEQLENS_Q/K differs every step, and it occurs different key value and compilation for the triton kernel each step, which leads to the performance degradation.
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L620
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L660

Currently, VARLEN is always set, and MAX_SEQLENS_Q/K are not used in this case when you look at the kernel in vllm.

vllm/vllm/attention/ops/triton_flash_attention.py

Line 309 in efb0432

def attn_fwd(

Therefore, we just set MAX_SEQLENS_Q/K as a fixed value when we call the kernel for a workaround.

maleksan85 · 2024-11-13T17:34:39Z

vllm/attention/ops/triton_flash_attention.py

seems like not used, right?

Right. Removed it.

maleksan85 · 2024-11-13T17:36:16Z

vllm/attention/ops/triton_flash_attention.py

probably worth to use:

vllm/vllm/utils.py

Line 1620 in 8f3bf8b

def is_navi() -> bool:

maleksan85 · 2024-11-13T17:37:46Z

vllm/attention/ops/triton_flash_attention.py

As per my knowledge AMD has two lines of HW for vllm: MI and Navi. So not navi should work better for future generations of MIs

maleksan85 · 2024-11-13T18:03:40Z

As per @gshtras we need to merge into develop branch instead of main for now. Please correct.

gshtras · 2024-11-13T18:10:46Z

vllm/attention/ops/triton_flash_attention.py

All this functionality is implemented in a cross-architecture fashion in the platform/rocm.py and its superclasses

hyoon1 · 2024-11-13T20:02:51Z

@maleksan85 @gshtras
Thanks for the comments, I agree with your suggestions. However, I have a few concerns regarding this matter and I'm seeking advice on how to proceed. First, I believe that triton_flash_attention.py in vllm is essentially a copy of the file from ROCm/triton. The modifications in this pull request directly apply the changes from the ROCm/triton repository's pull request #640. While it might be fine to change the relevant functions only in vllm, there is a risk of misalignment later on.

Secondly, our team is using the v0.6.2+rocm release, and I understand that functions like is_navi() are not supported in that version. Implementing them would require significant modifications. Therefore, maintaining backward compatibility is also a concern.

Given these considerations, I would greatly appreciate your advice on how to proceed with the modifications.

gshtras · 2024-11-13T23:22:12Z

As for your last point, whatever changes will be made here will not have any effect on the previous tags, so v0.6.2+rocm will not get affected.
That's true, our kernel is a snapshot of the one you mentioned, taken in the beginning of 2024. Our attempts to catch up in the past resulted in performance regressions on various models in different configs, which was noted to the team, but I don't believe it was thoroughly investigated. We may take another round of this experiment, but regardless, I think utilizing an existing infrastructure APIs that vllm provides is better here, if nothing else, for uniformity and avoiding code duplication.

maleksan85 · 2024-11-19T22:05:41Z

vllm/attention/ops/triton_flash_attention.py

are you sure that those new commits will not decrease performance on MI. If so, what models did you tested?
cc @gshtras

I have tested on Navi31. I thought it was tested by triton team for other models because they modified configs for better performance. https://github.com/ROCm/triton/blob/db2ca015159c6592c30a6bfcd77b9cc540063a8e/python/perf-kernels/flash-attention.py#L334

Beside those configs for autoconfig, I believe fixing MAX_SEQLENS_Q/K to 0 will increase the performance for MI as well.

We have tested chatglm2-6b, qwen-14b-chat, baichuan2-13b, llama-2-70b-chat, glm-4-9b-chat, qwen1.5-72b-chat-gptq, etc. on Navi31, w/o this change, triton-based FA2 has no positive perf lifting; while with this change, triton-based FA2 shows 2-5% gain. (and by debugging, it is confirmed that triton FA2 kernel cache is missed). We believe this should also provide positive impact on MI, especially during early triton kernel cache built-up period.

discussed in the chat to separate things for MI from this PR.

restored autotune configs for MI series

@hyoon1, could you please make this change only applicable to Navi? I will ask engineers in China to confirm the perf gain on Navi32 (although such cache misses issue has no dependencies on what GPU used). Thanks.

updated. MI will use original configs for autotune.

additional chatglm3-6b throughput test result on Navi 32 (16gb)
use triton / num-prompts 512 / max-model-len 512
original: input: 1234.33 toks/s, output: 921.11 toks/s Throughput: 5.31 requests/s, 2544.00 tokens/s
w/ update: input: 1386.34 toks/s, output: 1034.54 toks/s Throughput: 5.96 requests/s, 2856.15 tokens/s

- added Navi configurations (Related PR: ROCm/triton#640) - resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0

gshtras · 2024-12-05T21:47:41Z

Please try to avoid force pushes after the initial reviews. It makes it impossible to see the new changes.
Could you summarize what was changed since the original review?

vllm/attention/ops/triton_flash_attention.py

hyoon1 force-pushed the fix_max_seq branch from c50a528 to d565103 Compare November 12, 2024 23:47

hyoon1 requested a review from maleksan85 November 12, 2024 23:48

hyoon1 force-pushed the fix_max_seq branch from d565103 to 3318139 Compare November 12, 2024 23:54

maleksan85 reviewed Nov 13, 2024

View reviewed changes

maleksan85 requested a review from gshtras November 13, 2024 18:04

gshtras reviewed Nov 13, 2024

View reviewed changes

hyoon1 force-pushed the fix_max_seq branch from 3318139 to b5dc08c Compare November 13, 2024 19:12

hyoon1 changed the base branch from main to develop November 13, 2024 19:19

hyoon1 force-pushed the fix_max_seq branch 3 times, most recently from 3f81ad2 to 4cc77c2 Compare November 19, 2024 07:05

maleksan85 reviewed Nov 19, 2024

View reviewed changes

hyoon1 force-pushed the fix_max_seq branch from 473c413 to 4951f0f Compare November 21, 2024 20:29

gshtras previously approved these changes Nov 27, 2024

View reviewed changes

hyoon1 dismissed gshtras’s stale review via ce52a5e November 27, 2024 18:20

hyoon1 force-pushed the fix_max_seq branch from aa3f760 to ce52a5e Compare November 27, 2024 18:20

gshtras previously approved these changes Nov 27, 2024

View reviewed changes

hyoon1 dismissed gshtras’s stale review via c980fae December 5, 2024 21:07

hyoon1 force-pushed the fix_max_seq branch from ce52a5e to c980fae Compare December 5, 2024 21:07

Fix kernel cache miss and add RDNA configs

ab33e0f

- added Navi configurations (Related PR: ROCm/triton#640) - resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0

hyoon1 force-pushed the fix_max_seq branch from c980fae to ab33e0f Compare December 5, 2024 21:15

ilia-cher reviewed Dec 5, 2024

View reviewed changes

vllm/attention/ops/triton_flash_attention.py Outdated Show resolved Hide resolved

Remove Navi autotune configs for triton FP8 support

5022d5c

gshtras requested a review from ilia-cher December 6, 2024 15:20

Merge branch 'develop' into fix_max_seq

71ace0c

gshtras approved these changes Dec 6, 2024

View reviewed changes

ilia-cher approved these changes Dec 6, 2024

View reviewed changes

gshtras merged commit 8663822 into ROCm:develop Dec 6, 2024

hyoon1 mentioned this pull request Dec 7, 2024

Fix max_seqlens_q/k initialization for Navi GPUs #310

Merged

gshtras added a commit that referenced this pull request Dec 16, 2024

Fix regression from #246

9cbda10

gshtras added a commit that referenced this pull request Dec 16, 2024

Fix regression from #246 (#332)

d09f1ce

Conversation

hyoon1 commented Oct 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maleksan85 commented Nov 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyoon1 commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented Nov 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyoon1 Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gshtras commented Dec 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hyoon1 commented Nov 13, 2024 •

edited

Loading

hyoon1 Nov 27, 2024 •

edited

Loading