Skip to content

Support stream capture of event record and wait nodes in cuda graphs#155372

Closed
galv wants to merge 13 commits intopytorch:mainfrom
galv:dgalvez/finish-external-events-3
Closed

Support stream capture of event record and wait nodes in cuda graphs#155372
galv wants to merge 13 commits intopytorch:mainfrom
galv:dgalvez/finish-external-events-3

Conversation

@galv
Copy link
Collaborator

@galv galv commented Jun 6, 2025

These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng

@galv galv requested review from ngimel and nmacchioni June 6, 2025 21:58
@galv galv requested review from eqy and syed-ahmed as code owners June 6, 2025 21:58
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155372

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6aaeb54 with merge base bf7e290 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@galv galv added module: cuda graphs Ability to capture and then replay streams of CUDA kernels release notes: cuda release notes category labels Jun 6, 2025
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 9, 2025
@nmacchioni
Copy link
Contributor

Thanks for taking this over! This will unlock some very interesting possibilities for autotuning

@galv galv force-pushed the dgalvez/finish-external-events-3 branch from 5237974 to 744f7b2 Compare June 12, 2025 19:53
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks so much!

galv added 7 commits June 16, 2025 09:04
… right stream.

Correctly pass stream argument to cuLaunchKernel.

Previously, all kernels launched via _CudaKernel would use the NULL
stream. Whoops.
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
This reverts commit 5237974a9bc3ec105b46fe2a18904a95ca3a414e.

CI fails with errors like:

`error: use of undeclared identifier 'hipEventWaitExternal'`

Apparently the rocm version usd in CI is not new enough.
@galv galv force-pushed the dgalvez/finish-external-events-3 branch from ed53211 to 0f15a29 Compare June 16, 2025 16:07
Comment on lines +6507 to +6510
@unittest.skipIf(
not TEST_CUDA_GRAPH or TEST_WITH_ROCM,
"CUDA >= 11.0 required for external events in cuda graphs. rocm does not support external events",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this? cuda is always >=11.0 and you are skipping ROCM already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I ask that we defer removal of the CUDA >= 11.0 checks in the rest of the cuda graphs test code and implementation code for another PR, though. I'm scared of making PR's touch too much separate code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure but in this case it seems like a no brainer, and it's a new test.

@galv
Copy link
Collaborator Author

galv commented Jun 17, 2025

I made a small commit addressing #155372 (comment), but CI was passing, so this PR should be good to go once it passes again!

@ngimel
Copy link
Collaborator

ngimel commented Jun 17, 2025

Lint error is real, when fixed feel free to merge

@galv
Copy link
Collaborator Author

galv commented Jun 17, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 17, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda graphs Ability to capture and then replay streams of CUDA kernels open source release notes: cuda release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants