Open source TPU-friendly ragged paged attention kernel. by copybara-service[bot] · Pull Request #26920 · jax-ml/jax

copybara-service · 2025-03-04T21:38:58Z

Open source TPU-friendly ragged paged attention kernel.

Key features:

Support mixed prefill and decode to increase throughput for inference. (eg., 5x speedup compared to padded Muti-Queries Paged Attention implementation for llama-3-8b.)
No explicit swapaxes for seq_len and num_head in pre/post kernel. The kernel takes num_head in 2nd minor as it naturally was. We fold swapaxes to strided load/store in the kernel and apply transpose on the fly.
No GMM (Grouped Matmul) Metadata required! We calculate the metadata on the fly in the kernel. This can speed up 10%!
Increase MXU utilization 8x in GQA by grouping shared q heads for MXU in decode.
Minimize recompilation: The only factors can cause recompilation are model specs, max_num_batched_tokens and max_num_seqs in the setting of mixed engine.

Key features: * ***Support mixed prefill and decode*** to increase throughput for inference. (eg., ***5x*** speedup compared to padded Muti-Queries Paged Attention implementation for llama-3-8b.) * ***No explicit `swapaxes`*** for `seq_len` and `num_head` in pre/post kernel. The kernel takes `num_head` in 2nd minor as it naturally was. We fold swapaxes to strided load/store in the kernel and apply transpose on the fly. * ***No GMM (Grouped Matmul) Metadata required!*** We calculate the metadata on the fly in the kernel. This can speed up ***10%***! * ***Increase MXU utilization 8x in GQA*** by grouping shared q heads for MXU in decode. * ***Minimize recompilation:*** The only factors can cause recompilation are model specs, `max_num_batched_tokens` and `max_num_seqs` in the setting of mixed engine. PiperOrigin-RevId: 734269519

bythew3i mentioned this pull request Mar 5, 2025

Integrate ragged paged attention v2 pytorch/xla#8791

Merged

copybara-service Bot force-pushed the test_732369481 branch from be91ed1 to a82aec2 Compare March 6, 2025 21:21

copybara-service Bot force-pushed the test_732369481 branch from a82aec2 to 4b49c03 Compare March 6, 2025 21:36

copybara-service Bot merged commit 4b49c03 into main Mar 6, 2025

copybara-service Bot deleted the test_732369481 branch March 6, 2025 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open source TPU-friendly ragged paged attention kernel.#26920

Open source TPU-friendly ragged paged attention kernel.#26920
copybara-service[bot] merged 1 commit intomainfrom
test_732369481

copybara-service Bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

copybara-service Bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

copybara-service Bot commented Mar 4, 2025 •

edited

Loading