Skip to content

Error with driver API's lazy load of cuStream ops #3907

@samnordmann

Description

@samnordmann

nvBug

related Team's thread
Other related Team's thread

I am trying to use cuStreamWriteValue32 which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime error CUDA_ERROR_NOT_SUPPORTED. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.
Repro:

The source of problem can be narrowed down to lazy loading /usr/local/cuda/compat/lib.real/libcuda.so.1 in the pjnl container -- the bug comes either from lazy loading or from the library itself.

To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9d7d7b32..3e51bce8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -334,6 +334,7 @@ target_link_libraries(codegen_internal PUBLIC
   ${LIBCUPTI}
   ${TORCH_LIBRARIES}
   dl
+  cuda
)

add_library(nvfuser_codegen SHARED $<TARGET_OBJECTS:codegen_internal>)
diff --git a/csrc/driver_api.h b/csrc/driver_api.h
index 41072a22..b8c413a4 100644
--- a/csrc/driver_api.h
+++ b/csrc/driver_api.h
@@ -37,7 +37,6 @@ namespace nvfuser {
#if (CUDA_VERSION >= 12000)
#define ALL_DRIVER_API_WRAPPER(fn)   \
   ALL_DRIVER_API_WRAPPER_CUDA11(fn); \
-  fn(cuStreamWriteValue32);          \
   fn(cuTensorMapEncodeTiled)
#else
#define ALL_DRIVER_API_WRAPPER ALL_DRIVER_API_WRAPPER_CUDA11
diff --git a/tests/cpp/test_gpu3.cpp b/tests/cpp/test_gpu3.cpp
index 9570bb9b..a7236c6a 100644
--- a/tests/cpp/test_gpu3.cpp
+++ b/tests/cpp/test_gpu3.cpp
@@ -56,7 +56,8 @@
#include <sstream>
#include "parallel_dimension_map.h"

-#include <driver_api.h>
+// #include <driver_api.h>
+#include <cuda.h>
namespace nvfuser {

using namespace at::indexing;

and note also that cuda-gdb gives the following backtrace of the error:

#0  0x00007fff37f740f0 in cudbgReportDriverApiError () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fff381e312b in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fff2f4c0d47 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#3  0x00007fff2f49c29e in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#4  0x00007fff2f4af56d in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#5  0x00007fff2f5aebd6 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#6  0x00007fff380c05d0 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x0000555555a67b3e in lazilyLoadAndInvoke (args#0=0x7fff2ad0d618, args#1=140724802682880, args#2=3, args#3=0) at /opt/pytorch/Fuser2/csrc/driver_api.cpp:95

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions