[ragged-paged-attn] Use hidden states in kv cache and support any num_kv_head by bythew3i · Pull Request #8851 · pytorch/xla

bythew3i · 2025-03-18T22:26:07Z

This PR uses hidden states (num_kv_head * head_dim) in kv cache. This change can unblock us with any num_kv_head. Previous if num_kv_head == 1 and dtype=bfloat16, we will have implicit padding in TPU. But now, after just using hidden states directly from projection, we no-longer need to use strided load, but just load by slice directly.

This PR should help us support multi-chip sharding which shard num_kv_head to 1 for llama-3-70B.

Tested:

python test/test_pallas.py -v -k PallasTest.test_ragged_paged_attention_wrapper

vanbasten23 · 2025-03-18T22:30:50Z

    mask_value = DEFAULT_MASK_VALUE
-  validate_ragged_paged_attention_inputs(q, k_pages, v_pages, kv_lens,
-                                         page_indices, cu_q_lens, num_seqs)
+


why stopped checking validate_ragged_paged_attention_inputs?

Because we have these static shape check in JAX already

vanbasten23 · 2025-03-18T22:45:09Z

+
+  q_packing = get_dtype_packing(q_dtype)
+  max_q_tiling = 8 * q_packing
+  min_q_heads = lcm(max_q_tiling, num_q_heads_per_kv_head)


I am not sure if I follow. If dtype is bf16, then max_q_tiling is 16. If it's qwen where num_q_heads=12, kum_kv_head=2, num_q_heads_per_kv_head=6, then min_q_heads (=lcm(max_q_tiling, num_q_heads_per_kv_head)) will be 48. What does min_q_heads mean?

It tries to find a min number that is fully divisible by both max_q_tiling and num_q_heads_per_kv_head, if this number can divide total num_q_heads evenly, we just use this number as num_q_heads_per_blk. If we can not find one, we use the total num_q_heads .

Checking if it is divisible by max_q_tiling is to make sure it can be fully tiled by XLA.
Checking if it is divisible by num_q_heads_per_kv_head is to make sure we do not need to have inner split in num_q_heads_per_kv_head.

Thanks! Could you add what you said as a comment in the code?

vanbasten23 · 2025-03-18T22:49:32Z

+    raise ValueError(f"{num_seqs[0]=} must be less or equal to {max_num_seqs=}")
  max_kv_len = jnp.max(kv_lens)
-  min_pages_per_seq = ceil_div(max_kv_len, page_size)
+  min_pages_per_seq = cdiv(max_kv_len, page_size)


why is it min? Shouldn't it be max_pages_per_seq since you used cdiv(jnp.max(kv_lens), page_size)?

That is lower bound for pages_per_seq

vanbasten23 · 2025-03-18T22:52:08Z

+  _, page_size, kv_model_dim = k_pages.shape
+  kv_packing = get_dtype_packing(k_pages.dtype)
+  if page_size % kv_packing != 0:
+    raise ValueError(f"Expected {page_size=} is divisible by {kv_packing=}")


page_size % kv_packing != 0 indicating there will be padding so we may waste some memory. Can we give a warning instead of raising an exception?

The page size is chosen by the serving config, the error indicates we should choose better one. Otherwise when people using bf16 or quantized types (fp8, int8, int4) there will be no bandwidth saving. We should prevent this.

I see. I guess it's the same reason why before this PR when num_kv_head==1 and dtype=bf16, we would raise an exception

xla/torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py

Lines 534 to 536 in c7d0b1e

if not can_be_xla_fully_tiled(num_kv_heads, kv_packing):

raise ValueError(

f"Not implemented: {num_kv_heads=} can not be XLA fully tiled.")

(Previous if num_kv_head == 1 and dtype=bfloat16, we will have implicit padding in TPU.). The point is the code may still run fine but there will be no bandwidth savings.

Yes, the point of quantization is to save more memory and bandwidth

vanbasten23

Thanks Jevin. LGTM pending on CI.

vanbasten23 · 2025-03-19T03:01:16Z

@bythew3i I assume you have run the tests tests/pallas/tpu_ragged_paged_attention_test.py and they all pass?

bythew3i · 2025-03-19T04:33:38Z

@bythew3i I assume you have run the tests tests/pallas/tpu_ragged_paged_attention_test.py and they all pass?

Yes I tested the kernel.

…_kv_head (#8851)

bythew3i added 3 commits March 18, 2025 21:59

Refactor kv cache to use hidden states

07d1cfe

Fix undefined variable

3a89178

Fix kv cache shape in tests

ddbe805

vanbasten23 reviewed Mar 18, 2025

View reviewed changes

Comment thread torch_xla/experimental/custom_kernel.py

vanbasten23 reviewed Mar 18, 2025

View reviewed changes

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py Outdated

vanbasten23 reviewed Mar 18, 2025

View reviewed changes

Rename kv_model_dim to kv_hidden_size

add6be0

vanbasten23 approved these changes Mar 19, 2025

View reviewed changes

vanbasten23 merged commit 4190fc0 into pytorch:master Mar 19, 2025

vanbasten23 mentioned this pull request Mar 19, 2025

[V1][TPU] Change kv cache shape. vllm-project/vllm#15145

Merged

zpcore pushed a commit that referenced this pull request Mar 26, 2025

[ragged-paged-attn] Use hidden states in kv cache and support any num…

1dabbd6

…_kv_head (#8851)

zpcore mentioned this pull request Mar 26, 2025

[ragged-paged-attn] Use hidden states in kv cache and support any num… #8891

Open

	if not can_be_xla_fully_tiled(num_kv_heads, kv_packing):
	raise ValueError(
	f"Not implemented: {num_kv_heads=} can not be XLA fully tiled.")

Conversation

bythew3i commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bythew3i Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 left a comment

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Mar 19, 2025

Uh oh!

bythew3i commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bythew3i Mar 19, 2025 •

edited

Loading