[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 by nWEIdia · Pull Request #154097 · pytorch/pytorch

nWEIdia · 2025-05-22T03:30:46Z

Local test results on T4:

root@ab7cb4533ac9:/my_workspace/wei-pytorch# python test/distributed/test_symmetric_memory.py SymmMemSingleProcTest.test_stream_write_value32 -v
test_stream_write_value32 (main.SymmMemSingleProcTest.test_stream_write_value32) ... ok

Ran 1 test in 0.186s

OK

Copying the implementation from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/util/cuda_driver.cpp and NVIDIA/TransformerEngine#1835 (cc @ptrendx)

This PR modifies PyTorch to use the runtime driver API for the cuStreamWriteValue32 function
Here are the key changes:

Added get_symbol Function (c10/cuda/driver_api.cpp & .h)
This function:
Uses cudaGetDriverEntryPoint and cudaGetDriverEntryPointByVersion to dynamically load CUDA driver functions
Searches for driver entry points in the current runtime context
Provides version-aware symbol loading for better compatibility
Updated cuStreamWriteValue32 Implementation (CUDASymmetricMemoryOps.cu)
Removed Test Skip Condition (test_symmetric_memory.py)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet @huydhn @wujingyue

pytorch-bot · 2025-05-22T03:30:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154097

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c10c9a7 with merge base 3580b8d ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-clang / linux-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wujingyue · 2025-05-22T03:48:26Z

torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu

Why not fix this in c10/cuda/driver_api.h for all driver APIs?

wujingyue · 2025-05-22T03:49:48Z

torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu

See NVIDIA/Fuser#4344

wujingyue · 2025-05-22T03:50:03Z

torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu

See NVIDIA/Fuser#4349

malfet

Please don't use std::once_flag with C++11 there is no need for it anymore
Instead of fixing it for just one call, please update the entire mechanism for c10::cuda::DriverAPI if using this API is better then doing dlopen (which seems to be the case)

malfet · 2025-05-22T14:14:46Z

torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu

Please don't use std::once_flag but instead rely on C++11 built in mechanism by using something like

static auto pfn_cuStreamWriteValue32 = []() { // Move all the initialization logic here, which is guaranteed to be called once the first time function is called }();

ngimel

This should be fixed in c10::cuda::DriverAPI::get(), there is a reason we have a common wrapper for driver API to not have this spaghetti code at callsites.

ngimel · 2025-06-04T18:11:06Z

Any update on this one? What is the current status of distributed job on 12.6? cc @atalman

nWEIdia · 2025-06-04T19:18:56Z

Any update on this one? What is the current status of distributed job on 12.6? cc @atalman

I was mostly working on #154448 and am planning to revisit this.

#154119 (comment) still represents the latest status (i.e. distributed tests are now running with cuda 12.6, but with these 3 skips).

This reverts commit f510f60.

Update

ngimel · 2025-06-11T21:29:47Z

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryOps.cu

-  // stream rather than a CUDA thread.
-  C10_CUDA_DRIVER_CHECK(driver_api->cuStreamWriteValue32_(
+  static PFN_cuStreamWriteValue32_v2 pfn_cuStreamWriteValue32 = []() {
+    void *driver_ptr = c10::cuda::get_symbol("cuStreamWriteValue32");


the ask was to change the way we are acquiring driver symbols throughout the codebase, not just fix a single callsite for driver API and wait for when the others will break.

@nWEIdia

Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>

@nWEIdia

Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>

nWEIdia · 2025-06-22T21:11:15Z

This PR is no longer needed, thanks to @eee4017 for helping out and landing #156097

@nWEIdia

Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman Co-authored-by: Wei Wang <weiwan@nvidia.com>

Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com>

Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com> (cherry picked from commit a9f902a)

[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295) Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn (cherry picked from commit a9f902a) Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>

[CUDA] Use runtime driver API for cuStreamWriteValue32 (pytorch#158295) Reopen pytorch#156097 Fixes pytorch#154073 Reference: NVIDIA/Fuser#4197 See PR pytorch#156097 and pytorch#154097 Pull Request resolved: pytorch#158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn (cherry picked from commit a9f902a) Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 22, 2025

pytorchbot added the open source label May 22, 2025

eqy approved these changes May 22, 2025

View reviewed changes

wujingyue reviewed May 22, 2025

View reviewed changes

malfet requested changes May 22, 2025

View reviewed changes

ngimel requested changes May 22, 2025

View reviewed changes

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2025

nWEIdia force-pushed the main-fix-symmetric-memory-api-call branch from 45ac822 to e7c04a0 Compare June 11, 2025 00:34

nWEIdia requested a review from syed-ahmed as a code owner June 11, 2025 00:34

nWEIdia changed the title ~~[CUDA] Use runtime driver API for cuStreamWriteValue32~~ [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 Jun 11, 2025

nWEIdia added 2 commits June 10, 2025 17:38

[CUDA] Use runtime driver API for cuStreamWriteValue32

beb7fdc

Refactoring driver api

f510f60

nWEIdia force-pushed the main-fix-symmetric-memory-api-call branch from e7c04a0 to f510f60 Compare June 11, 2025 00:42

nWEIdia added 5 commits June 11, 2025 07:28

Revert "Refactoring driver api"

668c8cc

This reverts commit f510f60.

Copy TE

29f7913

Update

export get_symbol

0f36acf

Add TORCH_CHECK

9f2a996

Remove unnecessary mutex

c10c9a7

ngimel reviewed Jun 11, 2025

View reviewed changes

eee4017 mentioned this pull request Jun 16, 2025

[CUDA] Use runtime driver API for cuStreamWriteValue32 #156097

Closed

nWEIdia closed this Jun 22, 2025

eee4017 mentioned this pull request Jul 14, 2025

[CUDA] Use runtime driver API for cuStreamWriteValue32 #158295

Closed

pytorchbot mentioned this pull request Jul 17, 2025

[CUDA] Use runtime driver API for cuStreamWriteValue32 #158585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097
nWEIdia wants to merge 7 commits intopytorch:mainfrom
nWEIdia:main-fix-symmetric-memory-api-call

nWEIdia commented May 22, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 22, 2025 •

edited

Loading

Uh oh!

wujingyue May 22, 2025

Uh oh!

wujingyue May 22, 2025

Uh oh!

wujingyue May 22, 2025

Uh oh!

malfet left a comment •

edited

Loading

Uh oh!

malfet May 22, 2025

Uh oh!

ngimel left a comment

Uh oh!

ngimel commented Jun 4, 2025

Uh oh!

nWEIdia commented Jun 4, 2025

Uh oh!

ngimel Jun 11, 2025

Uh oh!

nWEIdia commented Jun 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

nWEIdia commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154097

❌ 1 New Failure

Uh oh!

wujingyue May 22, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue May 22, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue May 22, 2025

Choose a reason for hiding this comment

Uh oh!

malfet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jun 4, 2025

Uh oh!

nWEIdia commented Jun 4, 2025

Uh oh!

ngimel Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented Jun 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nWEIdia commented May 22, 2025 •

edited

Loading

pytorch-bot bot commented May 22, 2025 •

edited

Loading

malfet left a comment •

edited

Loading