[Pallas] Make FlashAttention as torch.autograd.Function by alanwaketan · Pull Request #6886 · pytorch/xla

alanwaketan · 2024-04-04T01:11:52Z

Summary:
This pull request makes the flash attention kernel as a torch.autograd.Function such that we can enable backward on the kernel.

Test Plan:
PJRT_DEVICE=TPU python test/test_pallas.py -v -k test_flash_attention_backward

JackCaoG · 2024-04-04T01:25:18Z

+    # Import JAX within the function such that we don't need to call the jax_import_guard()
+    # in the global scope which could cause problems for xmp.spawn.
+    jax_import_guard()


we shouldn't need this right? The forward should have been called at this point and fwd/bwd happens in the same process.

JackCaoG · 2024-04-04T01:28:34Z

+              "block_q_major", "block_k_major", "block_k", "sm_scale", "causal",
+              "mask_value", "debug"
+          ])
+      grad_q = torch.empty(q.shape, dtype=q.dtype).to(xm.xla_device())


it should be fine for the most part, but I think it is better to do .to(q.deivce) instead of move to xm.xla_device(). Should we add a check somewhere to make sure all tensors on XLA device?

Fixed the .to(). For the second question, I guess Mosaic already guard it?

alanwaketan · 2024-04-04T01:30:07Z

+        o.to(torch.float32) * grad_output.to(torch.float32),
+        axis=-1)  # [batch_size, num_heads, q_seq_len]
+
+    expanded_l = l.unsqueeze(-1).expand(3, 2, 128,


Oops, I shouldn't hardcode the shape...

JackCaoG · 2024-04-04T01:31:50Z

+    mse = torch.nn.MSELoss()
+    for i in [(q, q_grad), (k, k_grad), (v, v_grad)]:
+      self.assertTrue(mse(i[0].grad.cpu(), i[1].cpu()) < 1e-4)


I am not sure what this part is checking, do you mind explaining a bit?

Since the gradients are a little bit off, it's hard to use torch.allclose. I'm just trying to use MSE to calculate the difference to see if it's close to zero.

alanwaketan requested review from JackCaoG and lsy323 April 4, 2024 01:11

alanwaketan self-assigned this Apr 4, 2024

alanwaketan added 4 commits April 4, 2024 01:13

Initial commit

d18e108

Fix the test

39e1dea

Add a testcase somewhat

675c8b3

Fix linters

02a15f0

alanwaketan force-pushed the alanwaketan/fa_autograd branch from 5b316ba to 02a15f0 Compare April 4, 2024 01:14

Fix the test

f58c8e7

JackCaoG reviewed Apr 4, 2024

View reviewed changes

alanwaketan commented Apr 4, 2024

View reviewed changes

JackCaoG reviewed Apr 4, 2024

View reviewed changes

alanwaketan added 4 commits April 4, 2024 01:43

Imporve lmi

3cc5271

Fix linters

36027f0

Address commetns

2c71054

Minor fix

dcb859c

JackCaoG approved these changes Apr 4, 2024

View reviewed changes

alanwaketan merged commit 0c704cf into master Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pallas] Make FlashAttention as torch.autograd.Function#6886

[Pallas] Make FlashAttention as torch.autograd.Function#6886
alanwaketan merged 9 commits intomasterfrom
alanwaketan/fa_autograd

alanwaketan commented Apr 4, 2024

Uh oh!

JackCaoG Apr 4, 2024

Uh oh!

alanwaketan Apr 4, 2024

Uh oh!

JackCaoG Apr 4, 2024

Uh oh!

alanwaketan Apr 4, 2024

Uh oh!

alanwaketan Apr 4, 2024

Uh oh!

JackCaoG Apr 4, 2024

Uh oh!

alanwaketan Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanwaketan commented Apr 4, 2024

Uh oh!

JackCaoG Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants