[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097
[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32#154097nWEIdia wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154097
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit c10c9a7 with merge base 3580b8d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Why not fix this in c10/cuda/driver_api.h for all driver APIs?
There was a problem hiding this comment.
- Please don't use
std::once_flagwith C++11 there is no need for it anymore - Instead of fixing it for just one call, please update the entire mechanism for
c10::cuda::DriverAPIif using this API is better then doing dlopen (which seems to be the case)
There was a problem hiding this comment.
Please don't use std::once_flag but instead rely on C++11 built in mechanism by using something like
static auto pfn_cuStreamWriteValue32 = []() {
// Move all the initialization logic here, which is guaranteed to be called once the first time function is called
}();
ngimel
left a comment
There was a problem hiding this comment.
This should be fixed in c10::cuda::DriverAPI::get(), there is a reason we have a common wrapper for driver API to not have this spaghetti code at callsites.
|
Any update on this one? What is the current status of distributed job on 12.6? cc @atalman |
I was mostly working on #154448 and am planning to revisit this. #154119 (comment) still represents the latest status (i.e. distributed tests are now running with cuda 12.6, but with these 3 skips). |
45ac822 to
e7c04a0
Compare
e7c04a0 to
f510f60
Compare
This reverts commit f510f60.
| // stream rather than a CUDA thread. | ||
| C10_CUDA_DRIVER_CHECK(driver_api->cuStreamWriteValue32_( | ||
| static PFN_cuStreamWriteValue32_v2 pfn_cuStreamWriteValue32 = []() { | ||
| void *driver_ptr = c10::cuda::get_symbol("cuStreamWriteValue32"); |
There was a problem hiding this comment.
the ask was to change the way we are acquiring driver symbols throughout the codebase, not just fix a single callsite for driver API and wait for when the others will break.
Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>
Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>
Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: #156097 Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman Co-authored-by: Wei Wang <weiwan@nvidia.com>
Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com>
Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com> (cherry picked from commit a9f902a)
[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295) Reopen #156097 Fixes #154073 Reference: NVIDIA/Fuser#4197 See PR #156097 and #154097 Pull Request resolved: #158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn (cherry picked from commit a9f902a) Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>
[CUDA] Use runtime driver API for cuStreamWriteValue32 (pytorch#158295) Reopen pytorch#156097 Fixes pytorch#154073 Reference: NVIDIA/Fuser#4197 See PR pytorch#156097 and pytorch#154097 Pull Request resolved: pytorch#158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn (cherry picked from commit a9f902a) Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>
Fixes #154073
Reference: NVIDIA/Fuser#4197
Local test results on T4:
root@ab7cb4533ac9:/my_workspace/wei-pytorch# python test/distributed/test_symmetric_memory.py SymmMemSingleProcTest.test_stream_write_value32 -v
test_stream_write_value32 (main.SymmMemSingleProcTest.test_stream_write_value32) ... ok
Ran 1 test in 0.186s
OK
Copying the implementation from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/util/cuda_driver.cpp and NVIDIA/TransformerEngine#1835 (cc @ptrendx)
This PR modifies PyTorch to use the runtime driver API for the cuStreamWriteValue32 function
Here are the key changes:
This function:
Uses cudaGetDriverEntryPoint and cudaGetDriverEntryPointByVersion to dynamically load CUDA driver functions
Searches for driver entry points in the current runtime context
Provides version-aware symbol loading for better compatibility
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet @huydhn @wujingyue