Enable PagedAttention through Pallas by wonjoo-wj · Pull Request #6912 · pytorch/xla

wonjoo-wj · 2024-04-10T19:54:53Z

Enable PagedAttention through Pallas

Test plan:

root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper
.
----------------------------------------------------------------------
Ran 1 test in 2.209s

OK
root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper_with_dynamo
.
----------------------------------------------------------------------
Ran 1 test in 2.114s

OK

Todo as follow-ups:

Add unit test for Dynamo
Enable all other parameters for jax.experimental.pallas.ops.tpu.paged_attention.paged_attention_kernel

miladm · 2024-04-16T21:00:42Z

cc @WoosukKwon to take a look

wonjoo-wj · 2024-04-22T21:56:43Z

Locally, the tests are succeeding on my v4:

root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper
.
----------------------------------------------------------------------
Ran 1 test in 2.209s

OK
root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper_with_dynamo
.
----------------------------------------------------------------------
Ran 1 test in 2.114s

OK

I also just triggered the TPU CI on this PR.

wonjoo-wj · 2024-04-24T21:15:52Z

The CPU CI is failing with an unrelated test:

======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingBasicTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/xla/xla/pytorch/xla/test/dynamo/test_dynamo.py", line 494, in test_resnet18
    self.assertEqual(met.metric_data('ExecuteTime')[0], sample_count * 3)
AssertionError: 29 != 30

The CI including the TPU CI is passing, so this PR should be good to be reviewed. Thanks!

miladm · 2024-04-24T21:19:11Z

+        torch.allclose(
+            output.cpu()[seq_lens > 0],
+            expected_output.cpu()[seq_lens > 0],
+            atol=1e-1,


wdyt we use a tighter bound for atol and rtol? e.g. 1e-3

Sg, updated to 1e-5 for both tests.

miladm

Thanks @wonjoolee95 - left a comment for you to eval and address - approving to unblock you

alanwaketan

In general, it looks good to me. Left a few comments.

alanwaketan · 2024-04-24T23:48:09Z

  return FlashAttention.apply(q, k, v, causal)


+def paged_attention(q, k_pages, v_pages, lengths, page_indices,


The original kernel has this thing called: q_dtype_for_kernel_launch? What does it do? Should we copy that as well?

In the original kernel, the q_dtype_for_kernel_launch is always either jnp.float32 or q's dtype. In our case, I'm expecting the passed-in q's dtype to be torch.float32, so the q_dtype_for_kernel_launch will always be float32.

No, I don't think that will be the case for actual workflow. It could be bf16 or even in8 etc...

I see, makes sense. Just updated to handle q_dtype_for_kernel_launch, following jax's kernel -- https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/paged_attention/paged_attention_kernel.py#L393. I can follow-up in another PR to add some more unit tests for different dtypes for q.

alanwaketan · 2024-04-24T23:50:29Z

+                            pages_per_compute_block: int):
+  # This will be called when dynamo use fake tensor to construct the fake output.
+  # We need to make sure output tensor's shape is correct.
+  if k.device != torch.device("meta"):


It feels like this part can be consolidated with the flash attention one.

Sg, refactored these into a helper function.

alanwaketan · 2024-04-24T23:53:42Z

+  @unittest.skipIf(xr.device_type() != 'TPU' or tpu.version() < 4,
+                   "This test only works on TPUv4+.")
+  def test_paged_attention_wrapper(self):
+    jax.config.update('jax_default_matmul_precision', jax.lax.Precision.HIGHEST)


It's interesting that you use jax as the reference. I guess that works too. Wondering if we can just the eager attention helper in the class instead? Or that doesn't work? Anyway, if you are using jax as the reference, you can drop this.

Sg, yeah I saw that we're dependent on JAX Pallas anyways, so I thought it may be easier to just test against the JAX's outputs.

Ah, makes sense. Just removed the jax.config updates.

alanwaketan · 2024-04-24T23:59:21Z

+        q_xla,
+        k_pages_xla,
+        v_pages_xla,
+        seq_lens_xla,


Can you explain what these seq_lens are? Are these the previous tokens for each batch in k, v?

Yep, that is my understanding -- the seq_lens here equals the number of tokens that are processed in the batch. Reference: https://docs.vllm.ai/en/latest/dev/kernel/paged_attention.html#concepts.

JackCaoG · 2024-04-25T00:46:51Z

+  # We need to make sure output tensor's shape is correct.
+  if k.device != torch.device("meta"):
+    warnings.warn(
+        'XLA flash attention should only be applied to tensors on XLA device')


nit, paged attention instead of flash attention

actually it is not even paged attention, you can just make this warning message more general.

Good catch, updated to use an f string.

alanwaketan · 2024-04-25T00:48:48Z

  step = torch.zeros((1,), dtype=torch.int32).to("xla")
  output_shape = torch.Size(list(q.shape[:-1]) + [1])
+  q_output_dtype = torch.float32
+  if (num_heads // num_kv_heads) % 8 != 0:


I guess you can combine this with the above L396 code.

Good catch! Updated.

wonjoo-wj · 2024-04-25T00:54:58Z

Thanks all for the reviews. After addressing all the comments, the two unit tests are still passing locally on my V4. I'll let the TPU CI verify one more time before merging.

alanwaketan · 2024-04-25T00:55:43Z

+      ], payload, [q.shape, output_shape, output_shape],
+      [q_output_dtype, torch.float32, torch.float32])
+
+  return output.reshape(batch_size, num_heads, head_dim)


You probably want to use .to to cast the output back to the original dtype here.

wonjoo-wj · 2024-04-25T03:14:04Z

Merging as all CI is green.

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from 50dac57 to b6822a3 Compare April 10, 2024 19:55

JackCaoG reviewed Apr 10, 2024

View reviewed changes

Comment thread torch_xla/experimental/custom_kernel.py Outdated

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch 3 times, most recently from b0262b0 to 72fdd57 Compare April 10, 2024 21:12

alanwaketan self-requested a review April 11, 2024 18:45

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch 4 times, most recently from 45d9fe3 to f07c7e3 Compare April 12, 2024 22:14

miladm reviewed Apr 15, 2024

View reviewed changes

Comment thread torch_xla/experimental/custom_kernel.py

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from f07c7e3 to 1f70ed9 Compare April 16, 2024 20:13

wonjoo-wj changed the title ~~[WIP] Enable PagedAttention through Pallas~~ Enable PagedAttention through Pallas Apr 16, 2024

wonjoo-wj marked this pull request as ready for review April 16, 2024 20:16

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from 1f70ed9 to f32836e Compare April 16, 2024 20:27

miladm mentioned this pull request Apr 16, 2024

[RFC] Initial Support for Cloud TPUs vllm-project/vllm#3620

Closed

6 tasks

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch 4 times, most recently from cc5ad3a to 312bef1 Compare April 22, 2024 21:45

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from c6040cf to 19e28f8 Compare April 22, 2024 21:58

wonjoo-wj requested a review from JackCaoG April 24, 2024 21:14

wonjoo-wj requested a review from miladm April 24, 2024 21:16

miladm reviewed Apr 24, 2024

View reviewed changes

miladm approved these changes Apr 24, 2024

View reviewed changes

alanwaketan approved these changes Apr 25, 2024

View reviewed changes

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from 19e28f8 to b3a5948 Compare April 25, 2024 00:37

JackCaoG reviewed Apr 25, 2024

View reviewed changes

alanwaketan reviewed Apr 25, 2024

View reviewed changes

wonjoo-wj added 15 commits April 25, 2024 01:04

Add unit tests for PagedAttention

bfdac9f

Add kernel for PagedAttention

f1893ff

Run linter

6608939

Update unit tests and fix typos

7208441

Update int64 to int32

e2948b5

Split paged_attention into tracing and execution separately

acfc289

Add reshape in paged_attention

a7f7a87

Update unit tests

59280cf

Run linter again

9c4a4cc

Address comments

053ddef

Add q_output_dtype

2e48553

Fix typo in warning message

22c4f27

Combine some if statements

7f8dc98

Run linter one last time :)

4e5b0b0

Convert output back to qdtype

7818d0f

wonjoo-wj force-pushed the wonjoo/pallas-pagedattention branch from 961dfff to 7818d0f Compare April 25, 2024 01:04

wonjoo-wj merged commit 6ed2026 into master Apr 25, 2024

		return FlashAttention.apply(q, k, v, causal)


		def paged_attention(q, k_pages, v_pages, lengths, page_indices,

Conversation

wonjoo-wj commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miladm commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wonjoo-wj commented Apr 22, 2024

Uh oh!

wonjoo-wj commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miladm left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj commented Apr 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wonjoo-wj commented Apr 10, 2024 •

edited

Loading

miladm commented Apr 16, 2024 •

edited

Loading

wonjoo-wj commented Apr 24, 2024 •

edited

Loading

wonjoo-wj Apr 25, 2024 •

edited

Loading

wonjoo-wj Apr 25, 2024 •

edited

Loading

wonjoo-wj Apr 25, 2024 •

edited

Loading