Skip to content

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097

Closed
nWEIdia wants to merge 7 commits intopytorch:mainfrom
nWEIdia:main-fix-symmetric-memory-api-call
Closed

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097
nWEIdia wants to merge 7 commits intopytorch:mainfrom
nWEIdia:main-fix-symmetric-memory-api-call

Conversation

@nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented May 22, 2025

Fixes #154073

Reference: NVIDIA/Fuser#4197

Local test results on T4:

root@ab7cb4533ac9:/my_workspace/wei-pytorch# python test/distributed/test_symmetric_memory.py SymmMemSingleProcTest.test_stream_write_value32 -v
test_stream_write_value32 (main.SymmMemSingleProcTest.test_stream_write_value32) ... ok


Ran 1 test in 0.186s

OK

Copying the implementation from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/util/cuda_driver.cpp and NVIDIA/TransformerEngine#1835 (cc @ptrendx)

This PR modifies PyTorch to use the runtime driver API for the cuStreamWriteValue32 function
Here are the key changes:

  1. Added get_symbol Function (c10/cuda/driver_api.cpp & .h)
    This function:
    Uses cudaGetDriverEntryPoint and cudaGetDriverEntryPointByVersion to dynamically load CUDA driver functions
    Searches for driver entry points in the current runtime context
    Provides version-aware symbol loading for better compatibility
  2. Updated cuStreamWriteValue32 Implementation (CUDASymmetricMemoryOps.cu)
  3. Removed Test Skip Condition (test_symmetric_memory.py)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet @huydhn @wujingyue

@pytorch-bot
Copy link

pytorch-bot bot commented May 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154097

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c10c9a7 with merge base 3580b8d (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fix this in c10/cuda/driver_api.h for all driver APIs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please don't use std::once_flag with C++11 there is no need for it anymore
  • Instead of fixing it for just one call, please update the entire mechanism for c10::cuda::DriverAPI if using this API is better then doing dlopen (which seems to be the case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use std::once_flag but instead rely on C++11 built in mechanism by using something like

static auto pfn_cuStreamWriteValue32 = []() {
   // Move all the initialization logic here, which is guaranteed to be called once the first time function is called
}();

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed in c10::cuda::DriverAPI::get(), there is a reason we have a common wrapper for driver API to not have this spaghetti code at callsites.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2025
@ngimel
Copy link
Collaborator

ngimel commented Jun 4, 2025

Any update on this one? What is the current status of distributed job on 12.6? cc @atalman

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jun 4, 2025

Any update on this one? What is the current status of distributed job on 12.6? cc @atalman

I was mostly working on #154448 and am planning to revisit this.

#154119 (comment) still represents the latest status (i.e. distributed tests are now running with cuda 12.6, but with these 3 skips).

@nWEIdia nWEIdia force-pushed the main-fix-symmetric-memory-api-call branch from 45ac822 to e7c04a0 Compare June 11, 2025 00:34
@nWEIdia nWEIdia requested a review from syed-ahmed as a code owner June 11, 2025 00:34
@nWEIdia nWEIdia changed the title [CUDA] Use runtime driver API for cuStreamWriteValue32 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 Jun 11, 2025
@nWEIdia nWEIdia force-pushed the main-fix-symmetric-memory-api-call branch from e7c04a0 to f510f60 Compare June 11, 2025 00:42
// stream rather than a CUDA thread.
C10_CUDA_DRIVER_CHECK(driver_api->cuStreamWriteValue32_(
static PFN_cuStreamWriteValue32_v2 pfn_cuStreamWriteValue32 = []() {
void *driver_ptr = c10::cuda::get_symbol("cuStreamWriteValue32");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ask was to change the way we are acquiring driver symbols throughout the codebase, not just fix a single callsite for driver API and wait for when the others will break.

pytorchmergebot pushed a commit that referenced this pull request Jun 17, 2025
Fixes  #154073

Reference: NVIDIA/Fuser#4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: #156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
pytorchmergebot pushed a commit that referenced this pull request Jun 21, 2025
Fixes  #154073

Reference: NVIDIA/Fuser#4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: #156097
Approved by: https://github.com/ngimel

Co-authored-by: Wei Wang <weiwan@nvidia.com>
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jun 22, 2025

This PR is no longer needed, thanks to @eee4017 for helping out and landing #156097

@nWEIdia nWEIdia closed this Jun 22, 2025
pytorchmergebot pushed a commit that referenced this pull request Jul 10, 2025
Fixes  #154073

Reference: NVIDIA/Fuser#4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: #156097
Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman

Co-authored-by: Wei Wang <weiwan@nvidia.com>
pytorchmergebot pushed a commit that referenced this pull request Jul 16, 2025
pytorchbot pushed a commit that referenced this pull request Jul 17, 2025
Reopen #156097

Fixes #154073

Reference: NVIDIA/Fuser#4197

See PR #156097 and #154097

Pull Request resolved: #158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
(cherry picked from commit a9f902a)
atalman pushed a commit that referenced this pull request Jul 18, 2025
[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)

Reopen #156097

Fixes #154073

Reference: NVIDIA/Fuser#4197

See PR #156097 and #154097

Pull Request resolved: #158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn


(cherry picked from commit a9f902a)

Co-authored-by: Frank Lin <eee4017@gmail.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
tvukovic-amd pushed a commit to ROCm/pytorch that referenced this pull request Aug 20, 2025
[CUDA] Use runtime driver API for cuStreamWriteValue32 (pytorch#158295)

Reopen pytorch#156097

Fixes pytorch#154073

Reference: NVIDIA/Fuser#4197

See PR pytorch#156097 and pytorch#154097

Pull Request resolved: pytorch#158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn


(cherry picked from commit a9f902a)

Co-authored-by: Frank Lin <eee4017@gmail.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: CUDA driver error: operation not supported with test_stream_write_value32 and cuStreamWriteValue32

7 participants