Integrate dlpack to dynamo. by vanbasten23 · Pull Request #7173 · pytorch/xla

vanbasten23 · 2024-06-03T21:23:32Z

This PR integrate the DLPack API to dynamo so that when we move a tensor between CUDA and XLA we don't have to go through CPU anymore.

Test:
PJRT_DEVICE=CUDA python pytorch/xla/test/dynamo/test_dynamo.py -k test_simple_model_automoves_tensors

ysiraichi

Overall, LGTM.
I do think we need better testing for this use case, though. What do you think about running all dynamo tests with this flag set?

ysiraichi · 2024-06-10T15:34:56Z

+        << "The device currently being used : " << pjrt_device->DebugString()
+        << " is different from the device where the buffer resides: "


Cool, better error message!

vanbasten23 · 2024-06-10T18:56:49Z

Overall, LGTM. I do think we need better testing for this use case, though. What do you think about running all dynamo tests with this flag set?

Currently it works for inference but not for training. If we run all dynamo test, then we need to change all the test. OTOH, all this PR does is to move CUDA tensor to the XLA device at the beginning of the dynamo bridge, all the rest should remain the same. So do we really need to run all dynamo tests with the flag?

…o avoid null pjrt buffer when xla's default cuda device is different from the input cuda tensor's cuda device

JackCaoG · 2024-06-11T18:49:05Z

+        xenv.ZERO_COPY_ENABLED: zero_copy_enabled,
+    })
+    x = torch.tensor(100.0).to(device="cuda:0")
+    y = torch.tensor(200.0).to(device="cuda:0")


shouldn't you check that output is also on cuda:0?

and somehow verified that computation is run using dynamo not fallback so something

shouldn't you check that output is also on cuda:0?

The test already checks that

xla/test/dynamo/test_dynamo.py

Line 169 in 3f54fa2

self.assertTrue(res_xla_dynamo.device == original_device)

somehow verified that computation is run using dynamo not fallback so something

The test checks tracing is skipped in following runs

xla/test/dynamo/test_dynamo.py

Lines 172 to 175 in 3f54fa2

# verify that tracing is skipped in following runs

met.clear_counters()

res_xla_dynamo_reused = fn_simple_dynamo(x, y)

self.assertNotIn('xla::add', met.counter_names())

. Do you think it's enough?

Since we are looking for fallbacks, what about using torch_xla._XLAC._get_executed_fallback_ops?

SG. I added a check self.assertEqual(torch_xla._XLAC._get_executed_fallback_ops(), [])

JackCaoG · 2024-06-11T18:52:48Z

-    # Have to move to CPU before moving it to target device.
-    moved_tensor = tensor.to(cpu_device)
-    moved_tensor = moved_tensor.to(target_device)
+    zero_copy_enabled = xu.getenv_as(xenv.ZERO_COPY_ENABLED, bool, defval=False)


is there a reason this has to be a env var? Biggest reason we use env var is we need to communcate something between python and C++ layers, so it is the master process need to pass some information(like rank) to the child process. In your case this is really just a config somewhere and should not be set as env var.

for you case I think you can just always use dlpack, is there any reason we don't want to use dlpack to convert XLA:GPU and cuad?

The flag is only temporary. It's use to do a/b test: how much performance when we move tensor through CPU vs how much performance we can get by using dlpack.

So the temporary be removed later once the a/b testing is done.

…ed to call .detach()

…pjrt buffr is valid.

ysiraichi · 2024-06-12T18:09:34Z

Currently it works for inference but not for training. If we run all dynamo test, then we need to change all the test. OTOH, all this PR does is to move CUDA tensor to the XLA device at the beginning of the dynamo bridge, all the rest should remain the same. So do we really need to run all dynamo tests with the flag?

In this case, we could have a decorator or something like this for selecting a few tests to run with zero-copy. For example, there is DynamoInferenceBasicTest at test/dynamo/test_dynamo.py.

While this is a small change (quantity), it may end up changing the execution behavior in ways that we might not think of. I think it would make this PR more robust.

vanbasten23 · 2024-06-12T23:11:08Z

Currently it works for inference but not for training. If we run all dynamo test, then we need to change all the test. OTOH, all this PR does is to move CUDA tensor to the XLA device at the beginning of the dynamo bridge, all the rest should remain the same. So do we really need to run all dynamo tests with the flag?

In this case, we could have a decorator or something like this for selecting a few tests to run with zero-copy. For example, there is DynamoInferenceBasicTest at test/dynamo/test_dynamo.py.

Sure, I've added test for DynamoInferenceBasicTest

ysiraichi

LGTM.

I left a few minor comments.
Thank you for taking your time to adapt existing dynamo tests. They look great!

ysiraichi · 2024-06-13T14:31:12Z

+    # We need to make `dim` depend on `initialize_on_cuda` because the compilation cache
+    # does not clean itself between the parameterized tests.
+    dim = 5 + int(initialize_on_cuda)


Do you mean (i) dynamo's cache or (ii) XLA cache?

If it's (i), we can just reset it

If it's (ii), can't we just reset it, too? If not, I would argue that we don't need to worry about it, since that's not what's being tested, here

What do you think?

It's (ii) and I don't think there exists a way to reset the XLA compilation cache afaik.

I would argue that we don't need to worry about it, since that's not what's being tested, here

Actually, the existing test

xla/test/dynamo/test_dynamo.py

Line 276 in ea2a6f7

self.assertEqual(met.metric_data('CompileTime')[0], compile_count + 1)

tests the compilation cache. Without the change, the test will fail because of the reason in the comment. That's why I have to change it..

Ah, I see. I guess that works. You could also only run this check in one of the runs (e.g. if initialize_on_cuda == True). But, I think this is also ok.

vanbasten23 · 2024-06-14T04:26:30Z

Thanks for the review!

vanbasten23 changed the title ~~[DO NOT REVIEW YET] Integrate dlpack to dynamo.~~ Integrate dlpack to dynamo. Jun 6, 2024

vanbasten23 requested review from JackCaoG and ysiraichi June 6, 2024 17:42

vanbasten23 marked this pull request as ready for review June 6, 2024 17:43

ysiraichi reviewed Jun 10, 2024

View reviewed changes

vanbasten23 requested a review from ysiraichi June 10, 2024 18:59

vanbasten23 added 9 commits June 11, 2024 16:42

Add the test for training.

38b4be4

Integrate the dlpack with dynamo.

4e83fa5

add more stuff

1e4f5ba

the test passed

1e599c9

remove mark_step in _move_xla_cuda_tensor_to_cuda. It's only needed t…

bd0cb85

…o avoid null pjrt buffer when xla's default cuda device is different from the input cuda tensor's cuda device

add back the assert in pjrt_computation_client

30c087d

fix a test

f043219

fix up

46b2441

refine comments

0f3218c

vanbasten23 force-pushed the xiowei/integrate_dlpack_with_dynamo_fallback branch from 5b47510 to 0f3218c Compare June 11, 2024 17:15

JackCaoG reviewed Jun 11, 2024

View reviewed changes

vanbasten23 added 2 commits June 11, 2024 18:53

when the from_dlpack's input cuda tensor has required_grad=True, thne…

6b13135

…ed to call .detach()

need mark_step when converting xla tensor to cuda tensor so that the …

2dd2bf1

…pjrt buffr is valid.

vanbasten23 requested a review from JackCaoG June 12, 2024 15:55

Use cuda back on more tests.

41284a5

make sure there is no fallback.

8ba5d4b

ysiraichi approved these changes Jun 13, 2024

View reviewed changes

fix comments

18b4dfa

vanbasten23 merged commit c216d26 into master Jun 14, 2024

yitongh pushed a commit to AlibabaPAI/xla that referenced this pull request Oct 11, 2024

Integrate dlpack to dynamo. (pytorch#7173)

6245c47

miladm added the xla:gpu label Nov 22, 2024

miladm assigned vanbasten23 Nov 22, 2024

yitongh pushed a commit to AlibabaPAI/xla that referenced this pull request Dec 11, 2024

Integrate dlpack to dynamo. (pytorch#7173)

3ea8994

yitongh pushed a commit to AlibabaPAI/xla that referenced this pull request Dec 11, 2024

Integrate dlpack to dynamo. (pytorch#7173)

4ac2794

		<< "The device currently being used : " << pjrt_device->DebugString()
		<< " is different from the device where the buffer resides: "

	# verify that tracing is skipped in following runs
	met.clear_counters()
	res_xla_dynamo_reused = fn_simple_dynamo(x, y)
	self.assertNotIn('xla::add', met.counter_names())

Conversation

vanbasten23 commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Jun 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysiraichi commented Jun 12, 2024

Uh oh!

vanbasten23 commented Jun 12, 2024

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vanbasten23 commented Jun 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanbasten23 commented Jun 3, 2024 •

edited

Loading