Fix detection of NCCL_INCLUDE_DIR by ducksoup · Pull Request #2877 · pytorch/pytorch

ducksoup · 2017-09-27T13:20:01Z

NCCL_INCLUDE_DIR is now correctly detected when libnccl* is installed in /usr/lib/x86_64-linux-gnu/, /usr/lib/powerpc64le-linux-gnu/ or /usr/lib/aarch64-linux-gnu/.

The new code allows the headers to reside in any include folder which is a sub-folder of any parent of NCCL_LIB_DIR.

…nu and similar paths

soumith · 2017-09-27T14:58:27Z

the problem is that this is currently not allowed by gloo's cmake detection, which is why I enforced that NCCL_INCLUDE_DIR="$NCCL_LIB_DIR/../include" or $NCCL_ROOT_DIR/include.

Out of all the post-fixes you posted, the only one allowed in the NCCL detection for gloo to have a subdir is ${NCCL_ROOT_DIR}/lib/x86_64-linux-gnu
See: https://github.com/facebookincubator/gloo/blob/master/cmake/Modules/Findnccl.cmake#L22

I think this can be easily fixed by sending a PR to gloo first that allows it to specify NCCL_INCLUDE_DIR and NCCL_LIB_DIR separately as well, instead of just NCCL_ROOT_DIR

cc: @pietern

ducksoup · 2017-09-27T15:29:55Z

I guess there are three separate issues here:

nccl.py is not able to correctly detect nccl when the lib is in any of the /usr/lib/${ARCH}-linux-gnu folders and the header in /usr/local (this is the default when installing nccl 2 from the deb package in Ubuntu).
gloo's current cmake only handles x86_64-linux-gnu, but fails to locate nccl in powerpc64le-linux-gnu and aarch64-linux-gnu
gloo's current cmake doesn't have the option form manually-specified NCCL_INCLUDE_DIR and NCCL_LIB_DIR

I'm not entirely sure how to solve (3) as I've zero experience with cmake, but I could modify this PR to only solve (1) and create a PR for gloo to solve (2).

pietern · 2017-09-27T16:00:27Z

@soumith @ducksoup Easy enough, let me change that. It only checks the x86_64-linux-gnu directory because that's the tree that's packaged in the tarball. Prefixing NCCL_INCLUDE_DIR and NCCL_LIB_DIR looks like a good solution. It won't matter anymore where you store it.

pietern · 2017-09-27T17:31:35Z

@ducksoup Can you verify that the linked PR solved the issue for you?

Summary: It is flexible to have both XYZ_INCLUDE_DIR and XYZ_LIB_DIR for each dependency instead of just XYZ_ROOT_DIR. The gloo build should work with whatever directory structure is used for dependencies. This change also slightly changes the CMake variable naming to remove some cognitive overhead regarding upcase/downcase variables and multiple messages in the CMake output. Also see pytorch/pytorch#2877 Closes #91 Differential Revision: D5925849 Pulled By: pietern fbshipit-source-id: 66bf7699f736f51e197366c61a9d946bdf1ccec4

ngimel · 2017-09-27T23:44:45Z

Can we get this merged? In the current state there's no way to use deb-installed nccl, because after #2853 nccl.h is supposed to be in ../include/ relative to where library is, and that's not the case for deb-installed nccl (nccl.h is in /usr/include, libnccl is in /usr/lib/(ARCH)-linux-gnu)

pytorch/tools/setup_helpers/nccl.py

Lines 41 to 44 in ce4c441

    
           if os.getenv('NCCL_INCLUDE_DIR') is not None: 
        
               warnings.warn("Ignoring environment variable NCCL_INCLUDE_DIR because " 
        
                             "NCCL_INCLUDE_DIR is implicitly assumed as " 
        
                             "$NCCL_ROOT_DIR/include or $NCCL_LIB_DIR/../include")

ducksoup · 2017-09-28T07:27:41Z

@pietern Confirmed, issue is solved with the linked PR.

ducksoup · 2017-09-29T12:52:00Z

@soumith @ngimel PR updated to follow a similar logic to the new one in gloo. This builds successfully with nccl 2 and CUDA 9 on Ubuntu 16.04, potentially solving #2755

soumith · 2017-09-29T14:42:19Z

thanks @ducksoup !

…ipt. --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - pytorch#12063 and pytorch#2877: These are properly handled, as explained in the updated comment. - pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.

…ipt. (#22930) Summary: --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - #12063 and #2877: These are properly handled, as explained in the updated comment. - #2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions. Pull Request resolved: #22930 Differential Revision: D16440275 Pulled By: ezyang fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa

…ipt. (#22930) Summary: --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - pytorch/pytorch#12063 and pytorch/pytorch#2877: These are properly handled, as explained in the updated comment. - pytorch/pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions. Pull Request resolved: pytorch/pytorch#22930 Differential Revision: D16440275 Pulled By: ezyang fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa

Summary: It is flexible to have both XYZ_INCLUDE_DIR and XYZ_LIB_DIR for each dependency instead of just XYZ_ROOT_DIR. The gloo build should work with whatever directory structure is used for dependencies. This change also slightly changes the CMake variable naming to remove some cognitive overhead regarding upcase/downcase variables and multiple messages in the CMake output. Also see pytorch/pytorch#2877 Closes pytorch/gloo#91 Differential Revision: D5925849 Pulled By: pietern fbshipit-source-id: 66bf7699f736f51e197366c61a9d946bdf1ccec4

…A 3.5 (Strix Point/Halo) (pytorch#2877) Reverting prefered hipBLASLt backend for RDNA 3 and RDNA 3.5 (Strix Point/Halo) due to several internal tickets opened for perf regression compared to torch builds before merging ROCm#2741

Fix detection of nccl.h when libnccl.so is in /usr/lib/x86_64-linux-g…

2132ec3

…nu and similar paths

pietern mentioned this pull request Sep 27, 2017

Add XYZ_INCLUDE_DIR / XYZ_LIB_DIR for CMake dependencies pytorch/gloo#91

Closed

ngimel mentioned this pull request Sep 27, 2017

Build from source does not support cuda 9.0 #2755

Closed

ducksoup added 2 commits September 28, 2017 10:22

full support for independent NCCL_LIB_DIR and NCCL_INCLUDE_DIR

7dea8e8

lint fix

d648b59

add back CUDA_HOME

4b47092

soumith merged commit e67c2bc into pytorch:master Sep 29, 2017

ezyang added the open source label Jun 24, 2019

ezyang mentioned this pull request Jul 3, 2019

Let CMake handle NCCL detection instead of our handcrafted Python script. #22480

Closed

xuhdev mentioned this pull request Jul 12, 2019

Let CMake handle NCCL detection instead of our handcrafted Python script. #22818

Closed

xuhdev mentioned this pull request Jul 16, 2019

Let CMake handle NCCL detection instead of our handcrafted Python script. #22930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detection of NCCL_INCLUDE_DIR#2877

Fix detection of NCCL_INCLUDE_DIR#2877
soumith merged 4 commits intopytorch:masterfrom
ducksoup:nccl-detection

ducksoup commented Sep 27, 2017

Uh oh!

soumith commented Sep 27, 2017

Uh oh!

ducksoup commented Sep 27, 2017

Uh oh!

pietern commented Sep 27, 2017

Uh oh!

pietern commented Sep 27, 2017

Uh oh!

ngimel commented Sep 27, 2017

Uh oh!

ducksoup commented Sep 28, 2017

Uh oh!

ducksoup commented Sep 29, 2017

Uh oh!

soumith commented Sep 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ducksoup commented Sep 27, 2017

Uh oh!

soumith commented Sep 27, 2017

Uh oh!

ducksoup commented Sep 27, 2017

Uh oh!

pietern commented Sep 27, 2017

Uh oh!

pietern commented Sep 27, 2017

Uh oh!

ngimel commented Sep 27, 2017

Uh oh!

ducksoup commented Sep 28, 2017

Uh oh!

ducksoup commented Sep 29, 2017

Uh oh!

soumith commented Sep 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants