Add heuristic default block sizes for different cases in ragged attention kernel by yaochengji · Pull Request #8922 · pytorch/xla

yaochengji · 2025-04-02T00:58:46Z

No description provided.

…tion kernel

bythew3i · 2025-04-02T01:08:25Z

@@ -856,6 +867,22 @@ def test_ragged_paged_attention_wrapper_without_dynamo(
        use_dynamo=False,
    )


Remove the old _test_ragged_paged_attention?

We'd better also test the non-None block size parameter, that's why it's still there.

bythew3i · 2025-04-02T01:08:38Z

@@ -817,6 +812,22 @@ def test_ragged_paged_attention_wrapper_with_dynamo(
        use_dynamo=True,


Remove the old _test_ragged_paged_attention?

We'd better also test the non-None block size parameter, that's why it's still there.

bythew3i · 2025-04-02T01:09:57Z

+        raise NotImplementedError("TPU version must be 4 or higher.")
+    # NOTE: the TPU v4's vmem capacity is 16MB
+    if tpu_version == 4:
+        vmem_limit_bytes = 16 * 1024 * 1024


It is fine, even we set more than 16MB, it will still use 16 MB

Thanks, done.

vanbasten23 · 2025-04-02T03:53:16Z

+        soft_cap=soft_cap,
+        pad_tokens_and_seqs=pad_tokens_and_seqs,
+        use_dynamo=True,
+        num_kv_pages_per_block=None,


nit: consider to make the block sizes a parameter, eg num_kv_pages_per_block=[16, None], similar for num_queries_per_block

Thanks for the suggestion, done.

vanbasten23 · 2025-04-02T03:56:34Z

+    # This heristic is based on the initial kernel micro benchmarking:
+    # When the token_num is small, there's no long request of prefill.
+    # While when it's larger, the block size is adjusted for it.
+    if token_num <= 128:


I wonder if we should choose the block sizes in vLLM. If we choose to do in torch_xla, then we need to change it in torch_xla and wait for the wheel tmr. If we do in vLLM, it'd be more convenient. wdyt?

If vLLM pass a non-None block size, the default value will not be used.

I wonder why we couldn't do _get_default_ragged_paged_attention_block_size in vLLM..

Usually it's a good idea to put the tuned-parameter table in the kernel lib, not in the app lib.

vanbasten23

Thanks Chengji!

Add heuristic default block sizes for different cases in ragged atten…

64ac8b6

…tion kernel

yaochengji requested a review from vanbasten23 April 2, 2025 00:58

bythew3i reviewed Apr 2, 2025

View reviewed changes

vanbasten23 reviewed Apr 2, 2025

View reviewed changes

yaochengji added 5 commits April 2, 2025 05:54

fix

437fa7d

fix

e4a2376

fix

3e54ecb

fix

faa2d2a

fix block size

56147e0

vanbasten23 approved these changes Apr 2, 2025

View reviewed changes

set v4 of vmem_limit_bytes to 16MB

8286d56

yaochengji force-pushed the chengji/ragged-attn branch from 64d9bad to 8286d56 Compare April 3, 2025 03:37

fix

119a50b

yaochengji merged commit 6c3f231 into master Apr 3, 2025
23 checks passed

		@@ -856,6 +867,22 @@ def test_ragged_paged_attention_wrapper_without_dynamo(
		use_dynamo=False,
		)

		@@ -817,6 +812,22 @@ def test_ragged_paged_attention_wrapper_with_dynamo(
		use_dynamo=True,

Conversation

yaochengji commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaochengji commented Apr 2, 2025 •

edited

Loading