Let CMake handle NCCL detection instead of our handcrafted Python script.#22930
Let CMake handle NCCL detection instead of our handcrafted Python script.#22930xuhdev wants to merge 3 commits intopytorch:masterfrom
Conversation
|
Waiting for CI... This is identical to #22818 |
|
@ezyang This PR works quite well on PR CIs, but fails on those that are only run on master. Is there a way to see what's happening in those environments? |
|
@xuhdev There is only a janky way, which is to edit |
8c162a3 to
7ac63e7
Compare
d998fda to
f9ae21f
Compare
…ipt. --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - pytorch#12063 and pytorch#2877: These are properly handled, as explained in the updated comment. - pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.
[ci all]
f9ae21f to
ceb772a
Compare
|
@ezyang The PR should be ready now. The error before was caused by the default value of |
|
There are currently two commits in this PR. The first one is the real content and the second one is CI tweak. |
|
This LGTM. I'll go ahead and remove the CI twiddle and land. |
This reverts commit ceb772a.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ipt. (#22930) Summary: --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - pytorch/pytorch#12063 and pytorch/pytorch#2877: These are properly handled, as explained in the updated comment. - pytorch/pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions. Pull Request resolved: pytorch/pytorch#22930 Differential Revision: D16440275 Pulled By: ezyang fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa
|
@xuhdev Do I interpret the below change correctly to mean that USE_SYSTEM_NCCL will always be set to OFF by cmake, unless the user explicitly sets it to ON during the build using the USE_SYSTEM_NCCL=1 env variable? |
|
@jithunnair-amd What you said is right, but both for before and after this change. The changed lines you have quoted also make sure that the option |
|
@xuhdev Need some cmake expertise: in the file CMakeLists.txt, I need to ensure that USE_NCCL can be set for either CUDA or ROCM builds (for which the option will enable the RCCL library instead). How do I change the below to enable that? |
|
@jithunnair-amd The cleaning way I think is to add a if (USE_CUDA OR USE_ROCM)
set(USE_GPU TRUE)
else()
set(USE_GPU FALSE)
endif()Then replace |
How does the current code subsume all detections in the deleted
nccl.py?The dependency of
USE_NCCLon the OS andUSE_CUDAis handled as dependency options inCMakeLists.txt.The main NCCL detection happens in FindNCCL.cmake, which is called by nccl.cmake. When
USE_SYSTEM_NCCLis false, the previous Python code defer the detection tofind_package(NCCL). The change innccl.cmakeretains this.USE_STATIC_NCCLin the previous Python code simply changes the name of the detected library. This is done inIF (USE_STATIC_NCCL).Now we only need to look at how the lines below line 20 in
nccl.cmakeare subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done byfind_pathfor headers andfind_libraryfor the library files inFindNCCL.cmake.The call of find_path (Search for
NO_DEFAULT_PATHin the link) by default searches for headers in<prefix>/includefor each<prefix>inCMAKE_PREFIX_PATHandCMAKE_SYSTEM_PREFIX_PATH. Like the Python code, this commit setsCMAKE_PREFIX_PATHto search for<prefix>inNCCL_ROOT_DIRand home to CUDA.CMAKE_SYSTEM_PREFIX_PATHincludes the standard directories such as/usr/localand/usr.NCCL_INCLUDE_DIRis also specifically handled.Similarly, the call of find_library (Search for
NO_DEFAULT_PATHin the link) by default searches for libraries in directories including<prefix>/libfor each<prefix>inCMAKE_PREFIX_PATHandCMAKE_SYSTEM_PREFIX_PATH. But it also handles the edge cases intended to be solved in the Python code more properly:<prefix>/lib64(and<prefix>/lib32) if it is appropriate on the system.<prefix>/lib/<arch>for the right<arch>, unlike the Python code searches forlib/<arch>in a generic way (e.g., the Python code searches for/usr/lib/x86_64-linux-gnubut in reality systems have/usr/lib/x86_64-some-customized-name-linux-gnu, see https://unix.stackexchange.com/a/226180/38242 ).Regarding for relevant issues: