Let CMake handle NCCL detection instead of our handcrafted Python script.#22818
Let CMake handle NCCL detection instead of our handcrafted Python script.#22818xuhdev wants to merge 8 commits intogh/xuhdev/13/basefrom
Conversation
…ipt. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html
|
Hi! I'm wondering if this PR will fix my issue. I'm having problems compiling with libtorch on a Linux installation where CUDA is not in Do you think this PR handles my case? |
This PR is about NCCL detection, not CUDA. You can try to set |
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
|
After this PR the environment variable |
| SET(NCCL_LIBNAME "nccl") | ||
| ENDIF() | ||
| if (NCCL_VERSION) # Prefer the versioned library if a specific NCCL version is specified | ||
| set(CMAKE_FIND_LIBRARY_SUFFIXES ".so.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES}) |
There was a problem hiding this comment.
I'm confused about the placement of this line. It occurs inside the USE_STATIC_NCCL block. Doesn't that mean that it will never actually usefully trigger (because we'll be linking against static nccl?)
There was a problem hiding this comment.
I think this is under the else-clause? This line will be triggered only if USE_STATIC_NCCL is false.
| ${NCCL_ROOT_DIR}/lib/x86_64-linux-gnu | ||
| ${NCCL_ROOT_DIR}/lib64 | ||
| ${CUDA_TOOLKIT_ROOT_DIR}/lib64) | ||
| HINTS ${NCCL_LIB_DIR}) |
There was a problem hiding this comment.
I'm actually a little confused by what the manual says here; I think in practice this means that we will stop looking in lib64 location because cmake will only look in that directory if FIND_LIBRARY_USE_LIB64_PATHS is set (and I don't think it is set on ordinary platforms). I don't think this matters though, because I checked some of my system nccl installs and they appear to be in non-lib64 directory.
There was a problem hiding this comment.
I think FIND_LIBRARY_USE_LIB64_PATHS is set when it is proper (such as a multiarch system) to do so. From the document of find_library:
If the FIND_LIBRARY_USE_LIB64_PATHS global property is set all search paths will be tested as normal, with 64/ appended, and with all matches of lib/ replaced with lib64/. This property is automatically set for the platforms that are known to need it if at least one of the languages supported by the project() command is enabled.
ezyang
left a comment
There was a problem hiding this comment.
This looks very good but I am concerned about the loss of NCCL_ROOT_DIR envvar support. (The dynamic library version pinning is also probably broken, but that error seems pretty benign.)
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script." Let CMake handle NCCL detection instead of our handcrafted Python script. - USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build options that depend on other options. - Detections using folders such as lib64, lib64/<arch> are automatically handled in find_library in CMake. - NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the standard CMake variable that specifies the root dir of a library (and therefore supports a wide range of automatic detection). E.g., see https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
Do you also wanna use this findNCCL once review is finished? This copy of findNCCL is actually pretty generic. |
I think the gloo bits live in another repository, so we can't easily replace it. |
|
@ezyang could you land this when you get a chance? |
|
@ezyang Same request as Richard. Thanks. |
|
I reopened this is #22930 . For some reason this actually has already been landed but was reverted later. The stack here is actually meaningless now. |
|
Thanks, I'll land that one. |
Stack from ghstack:
options that depend on other options.
handled in find_library in CMake.
standard CMake variable that specifies the root dir of a library (and
therefore supports a wide range of automatic detection). E.g., see
https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html
How does the current code subsume all detections in the deleted
nccl.py?The dependency of
USE_NCCLon the OS andUSE_CUDAis handled as dependency options inCMakeLists.txt.The main NCCL detection happens in FindNCCL.cmake, which is called by nccl.cmake. When
USE_SYSTEM_NCCLis false, the previous Python code defer the detection tofind_package(NCCL). The change innccl.cmakeretains this.USE_STATIC_NCCLin the previous Python code simply changes the name of the detected library. This is done inIF (USE_STATIC_NCCL).Now we only need to look at how the lines below line 20 in
nccl.cmakeare subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done byfind_pathfor headers andfind_libraryfor the library files inFindNCCL.cmake.The call of find_path (Search for
NO_DEFAULT_PATHin the link) by default searches for headers in<prefix>/includefor each<prefix>inCMAKE_PREFIX_PATHandCMAKE_SYSTEM_PREFIX_PATH. Like the Python code, this commit setsCMAKE_PREFIX_PATHto search for<prefix>inNCCL_ROOT_DIRand home to CUDA.CMAKE_SYSTEM_PREFIX_PATHincludes the standard directories such as/usr/localand/usr.NCCL_INCLUDE_DIRis also specifically handled.Similarly, the call of find_library (Search for
NO_DEFAULT_PATHin the link) by default searches for libraries in directories including<prefix>/libfor each<prefix>inCMAKE_PREFIX_PATHandCMAKE_SYSTEM_PREFIX_PATH. But it also handles the edge cases intended to be solved in the Python code more properly:<prefix>/lib64(and<prefix>/lib32) if it is appropriate on the system.<prefix>/lib/<arch>for the right<arch>, unlike the Python code searches forlib/<arch>in a generic way (e.g., the Python code searches for/usr/lib/x86_64-linux-gnubut in reality systems have/usr/lib/x86_64-some-customized-name-linux-gnu, see https://unix.stackexchange.com/a/226180/38242 ).Regarding for relevant issues:
Differential Revision: D16281714