Integrate dlpack to dynamo.#7173
Conversation
ysiraichi
left a comment
There was a problem hiding this comment.
Overall, LGTM.
I do think we need better testing for this use case, though. What do you think about running all dynamo tests with this flag set?
| << "The device currently being used : " << pjrt_device->DebugString() | ||
| << " is different from the device where the buffer resides: " |
There was a problem hiding this comment.
Cool, better error message!
Currently it works for inference but not for training. If we run all dynamo test, then we need to change all the test. OTOH, all this PR does is to move CUDA tensor to the XLA device at the beginning of the dynamo bridge, all the rest should remain the same. So do we really need to run all dynamo tests with the flag? |
…o avoid null pjrt buffer when xla's default cuda device is different from the input cuda tensor's cuda device
5b47510 to
0f3218c
Compare
| xenv.ZERO_COPY_ENABLED: zero_copy_enabled, | ||
| }) | ||
| x = torch.tensor(100.0).to(device="cuda:0") | ||
| y = torch.tensor(200.0).to(device="cuda:0") |
There was a problem hiding this comment.
shouldn't you check that output is also on cuda:0?
There was a problem hiding this comment.
and somehow verified that computation is run using dynamo not fallback so something
There was a problem hiding this comment.
shouldn't you check that output is also on cuda:0?
The test already checks that
xla/test/dynamo/test_dynamo.py
Line 169 in 3f54fa2
somehow verified that computation is run using dynamo not fallback so something
The test checks tracing is skipped in following runs
xla/test/dynamo/test_dynamo.py
Lines 172 to 175 in 3f54fa2
There was a problem hiding this comment.
Since we are looking for fallbacks, what about using torch_xla._XLAC._get_executed_fallback_ops?
There was a problem hiding this comment.
SG. I added a check self.assertEqual(torch_xla._XLAC._get_executed_fallback_ops(), [])
| # Have to move to CPU before moving it to target device. | ||
| moved_tensor = tensor.to(cpu_device) | ||
| moved_tensor = moved_tensor.to(target_device) | ||
| zero_copy_enabled = xu.getenv_as(xenv.ZERO_COPY_ENABLED, bool, defval=False) |
There was a problem hiding this comment.
is there a reason this has to be a env var? Biggest reason we use env var is we need to communcate something between python and C++ layers, so it is the master process need to pass some information(like rank) to the child process. In your case this is really just a config somewhere and should not be set as env var.
There was a problem hiding this comment.
for you case I think you can just always use dlpack, is there any reason we don't want to use dlpack to convert XLA:GPU and cuad?
There was a problem hiding this comment.
The flag is only temporary. It's use to do a/b test: how much performance when we move tensor through CPU vs how much performance we can get by using dlpack.
So the temporary be removed later once the a/b testing is done.
…ed to call .detach()
…pjrt buffr is valid.
In this case, we could have a decorator or something like this for selecting a few tests to run with zero-copy. For example, there is While this is a small change (quantity), it may end up changing the execution behavior in ways that we might not think of. I think it would make this PR more robust. |
Sure, I've added test for |
ysiraichi
left a comment
There was a problem hiding this comment.
LGTM.
I left a few minor comments.
Thank you for taking your time to adapt existing dynamo tests. They look great!
| # We need to make `dim` depend on `initialize_on_cuda` because the compilation cache | ||
| # does not clean itself between the parameterized tests. | ||
| dim = 5 + int(initialize_on_cuda) |
There was a problem hiding this comment.
Do you mean (i) dynamo's cache or (ii) XLA cache?
- If it's (i), we can just reset it
- If it's (ii), can't we just reset it, too? If not, I would argue that we don't need to worry about it, since that's not what's being tested, here
What do you think?
There was a problem hiding this comment.
It's (ii) and I don't think there exists a way to reset the XLA compilation cache afaik.
I would argue that we don't need to worry about it, since that's not what's being tested, here
Actually, the existing test
xla/test/dynamo/test_dynamo.py
Line 276 in ea2a6f7
There was a problem hiding this comment.
Ah, I see. I guess that works. You could also only run this check in one of the runs (e.g. if initialize_on_cuda == True). But, I think this is also ok.
|
Thanks for the review! |
This PR integrate the DLPack API to dynamo so that when we move a tensor between CUDA and XLA we don't have to go through CPU anymore.
Test:
PJRT_DEVICE=CUDA python pytorch/xla/test/dynamo/test_dynamo.py -k test_simple_model_automoves_tensors