Fix kernel cache miss and add RDNA configs#246
Conversation
hyoon1
commented
Oct 25, 2024
- added Navi configurations (Related PR: add RDNA Config triton#640)
- resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0
There was a problem hiding this comment.
what is the reason to zero seq lens?
There was a problem hiding this comment.
Below attention fwd kernel is called when we run the model with vllm:
However, MAX_SEQLENS_Q/K differs every step, and it occurs different key value and compilation for the triton kernel each step, which leads to the performance degradation.
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L620
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L660
Currently, VARLEN is always set, and MAX_SEQLENS_Q/K are not used in this case when you look at the kernel in vllm.
Therefore, we just set MAX_SEQLENS_Q/K as a fixed value when we call the kernel for a workaround.
There was a problem hiding this comment.
probably worth to use:
Line 1620 in 8f3bf8b
There was a problem hiding this comment.
As per my knowledge AMD has two lines of HW for vllm: MI and Navi. So not navi should work better for future generations of MIs
|
As per @gshtras we need to merge into develop branch instead of main for now. Please correct. |
There was a problem hiding this comment.
All this functionality is implemented in a cross-architecture fashion in the platform/rocm.py and its superclasses
|
@maleksan85 @gshtras Secondly, our team is using the v0.6.2+rocm release, and I understand that functions like is_navi() are not supported in that version. Implementing them would require significant modifications. Therefore, maintaining backward compatibility is also a concern. Given these considerations, I would greatly appreciate your advice on how to proceed with the modifications. |
|
As for your last point, whatever changes will be made here will not have any effect on the previous tags, so v0.6.2+rocm will not get affected. |
3f81ad2 to
4cc77c2
Compare
There was a problem hiding this comment.
are you sure that those new commits will not decrease performance on MI. If so, what models did you tested?
cc @gshtras
There was a problem hiding this comment.
I have tested on Navi31. I thought it was tested by triton team for other models because they modified configs for better performance. https://github.com/ROCm/triton/blob/db2ca015159c6592c30a6bfcd77b9cc540063a8e/python/perf-kernels/flash-attention.py#L334
Beside those configs for autoconfig, I believe fixing MAX_SEQLENS_Q/K to 0 will increase the performance for MI as well.
There was a problem hiding this comment.
We have tested chatglm2-6b, qwen-14b-chat, baichuan2-13b, llama-2-70b-chat, glm-4-9b-chat, qwen1.5-72b-chat-gptq, etc. on Navi31, w/o this change, triton-based FA2 has no positive perf lifting; while with this change, triton-based FA2 shows 2-5% gain. (and by debugging, it is confirmed that triton FA2 kernel cache is missed). We believe this should also provide positive impact on MI, especially during early triton kernel cache built-up period.
There was a problem hiding this comment.
discussed in the chat to separate things for MI from this PR.
There was a problem hiding this comment.
restored autotune configs for MI series
There was a problem hiding this comment.
@hyoon1, could you please make this change only applicable to Navi? I will ask engineers in China to confirm the perf gain on Navi32 (although such cache misses issue has no dependencies on what GPU used). Thanks.
There was a problem hiding this comment.
updated. MI will use original configs for autotune.
There was a problem hiding this comment.
additional chatglm3-6b throughput test result on Navi 32 (16gb)
use triton / num-prompts 512 / max-model-len 512
original: input: 1234.33 toks/s, output: 921.11 toks/s Throughput: 5.31 requests/s, 2544.00 tokens/s
w/ update: input: 1386.34 toks/s, output: 1034.54 toks/s Throughput: 5.96 requests/s, 2856.15 tokens/s
- added Navi configurations (Related PR: ROCm/triton#640) - resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0
|
Please try to avoid force pushes after the initial reviews. It makes it impossible to see the new changes. |