Support stream capture of event record and wait nodes in cuda graphs#155372
Support stream capture of event record and wait nodes in cuda graphs#155372galv wants to merge 13 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155372
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6aaeb54 with merge base bf7e290 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Thanks for taking this over! This will unlock some very interesting possibilities for autotuning |
5237974 to
744f7b2
Compare
… right stream. Correctly pass stream argument to cuLaunchKernel. Previously, all kernels launched via _CudaKernel would use the NULL stream. Whoops.
These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution.
This reverts commit 5237974a9bc3ec105b46fe2a18904a95ca3a414e. CI fails with errors like: `error: use of undeclared identifier 'hipEventWaitExternal'` Apparently the rocm version usd in CI is not new enough.
ed53211 to
0f15a29
Compare
| @unittest.skipIf( | ||
| not TEST_CUDA_GRAPH or TEST_WITH_ROCM, | ||
| "CUDA >= 11.0 required for external events in cuda graphs. rocm does not support external events", | ||
| ) |
There was a problem hiding this comment.
Delete this? cuda is always >=11.0 and you are skipping ROCM already
There was a problem hiding this comment.
Sure. I ask that we defer removal of the CUDA >= 11.0 checks in the rest of the cuda graphs test code and implementation code for another PR, though. I'm scared of making PR's touch too much separate code.
There was a problem hiding this comment.
Yeah sure but in this case it seems like a no brainer, and it's a new test.
|
I made a small commit addressing #155372 (comment), but CI was passing, so this PR should be good to go once it passes again! |
|
Lint error is real, when fixed feel free to merge |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.
We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().
If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events
In short, they will be used to experess fork and join operations in
the graph if external=False.
External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
Finishes #146145
I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.
cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng