[Feature] Faster Custom Paged Attention kernels by tjtanaa · Pull Request #385 · ROCm/vllm

tjtanaa · 2025-01-24T01:23:30Z

Description

This PR implements a faster Custom Paged Attention (CPA) kernel based on mfma16x16x16 instructions.
This feature is from ROCm/vllm (#372).

Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> (cherry picked from commit 9a46e97)

Adding build only k8s node and queue names update

- max_seqlens_q/k variables were not correctly initialized for Navi GPUs leading to incorrect outputs. - ensure that the correct values are passed to the attn_fwd kernel based on the GPU type.

…platform (ROCm#313) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

…g_attn_to_llama_fp8

adjusted attention kernel unit test.

…g_attn_to_llama_fp8

maleksan85 and others added 18 commits November 13, 2024 10:22

corrected types for strides in triton FA (ROCm#274) (ROCm#276)

efb0432

Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> (cherry picked from commit 9a46e97)

Update test-template.j2 (ROCm#283)

d291770

Adding build only k8s node and queue names update

Fix max_seqlens_q/k initialization for Navi GPUs (ROCm#310)

679a15c

- max_seqlens_q/k variables were not correctly initialized for Navi GPUs leading to incorrect outputs. - ensure that the correct values are passed to the attn_fwd kernel based on the GPU type.

Merge remote-tracking branch 'origin/develop'

fb82bf1

Setting the value for the scpecilative decoding worker class on rocm …

22f9066

…platform (ROCm#313) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

fp8 cpa

f57aa62

load nt optional for kv

4145bae

enable alibi; fix gfx90a compile

7bb4a48

checkpoint with head size 64 supported

bca5077

block size 32 fix

e2e7d19

Merge remote-tracking branch 'origin/shsanyal_develop_cpa_fp8' into p…

3c5b26e

…g_attn_to_llama_fp8

clean up

2a669e4

further clean up and comments

51623be

kernel bug fixes and code cleaning.

e52eb1f

adjusted attention kernel unit test.

Merge remote-tracking branch 'origin/shsanyal_develop_cpa_fp8' into p…

3a6f752

…g_attn_to_llama_fp8

fix unit test for rocm custom attention kernel

7267062

fix benchmark paged attention

b8e66a9

[Bugfix]: fix v1/v2 paged attention kernel unit test.

0f6ff75

This was referenced Jan 24, 2025

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v1) #346

Closed

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

Closed

tjtanaa closed this Feb 10, 2025

tjtanaa deleted the pg_attn_to_llama_fp8 branch February 10, 2025 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Faster Custom Paged Attention kernels#385

[Feature] Faster Custom Paged Attention kernels#385
tjtanaa wants to merge 18 commits intoROCm:llama_fp8_12062024from
EmbeddedLLM:pg_attn_to_llama_fp8

tjtanaa commented Jan 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

tjtanaa commented Jan 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tjtanaa commented Jan 24, 2025 •

edited by github-actions bot

Loading