Skip to content

Let CMake handle NCCL detection instead of our handcrafted Python script.#22930

Closed
xuhdev wants to merge 3 commits intopytorch:masterfrom
xuhdev:build/detect-nccl
Closed

Let CMake handle NCCL detection instead of our handcrafted Python script.#22930
xuhdev wants to merge 3 commits intopytorch:masterfrom
xuhdev:build/detect-nccl

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Jul 16, 2019


How does the current code subsume all detections in the deleted nccl.py?

  • The dependency of USE_NCCL on the OS and USE_CUDA is handled as dependency options in CMakeLists.txt.

  • The main NCCL detection happens in FindNCCL.cmake, which is called by nccl.cmake. When USE_SYSTEM_NCCL is false, the previous Python code defer the detection to find_package(NCCL). The change in nccl.cmake retains this.

  • USE_STATIC_NCCL in the previous Python code simply changes the name of the detected library. This is done in IF (USE_STATIC_NCCL).

  • Now we only need to look at how the lines below line 20 in nccl.cmake are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by find_path for headers and find_library for the library files in FindNCCL.cmake.

    • The call of find_path (Search for NO_DEFAULT_PATH in the link) by default searches for headers in <prefix>/include for each <prefix> in CMAKE_PREFIX_PATH and CMAKE_SYSTEM_PREFIX_PATH. Like the Python code, this commit sets CMAKE_PREFIX_PATH to search for <prefix> in NCCL_ROOT_DIR and home to CUDA. CMAKE_SYSTEM_PREFIX_PATH includes the standard directories such as /usr/local and /usr. NCCL_INCLUDE_DIR is also specifically handled.

    • Similarly, the call of find_library (Search for NO_DEFAULT_PATH in the link) by default searches for libraries in directories including <prefix>/lib for each <prefix> in CMAKE_PREFIX_PATH and CMAKE_SYSTEM_PREFIX_PATH. But it also handles the edge cases intended to be solved in the Python code more properly:

      • It only searches for <prefix>/lib64 (and <prefix>/lib32) if it is appropriate on the system.
      • It only searches for <prefix>/lib/<arch> for the right <arch>, unlike the Python code searches for lib/<arch> in a generic way (e.g., the Python code searches for /usr/lib/x86_64-linux-gnu but in reality systems have /usr/lib/x86_64-some-customized-name-linux-gnu, see https://unix.stackexchange.com/a/226180/38242 ).

Regarding for relevant issues:

When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.

@pytorchbot pytorchbot added module: build Build system issues module: nccl Problems related to nccl support labels Jul 16, 2019
@xuhdev xuhdev changed the title Let CMake handle NCCL detection instead of our handcrafted Python script. [WIP] Let CMake handle NCCL detection instead of our handcrafted Python script. Jul 16, 2019
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 16, 2019

Waiting for CI... This is identical to #22818

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 16, 2019

@ezyang This PR works quite well on PR CIs, but fails on those that are only run on master. Is there a way to see what's happening in those environments?

@xuhdev xuhdev changed the title [WIP] Let CMake handle NCCL detection instead of our handcrafted Python script. Let CMake handle NCCL detection instead of our handcrafted Python script. Jul 18, 2019
@xuhdev xuhdev requested a review from ezyang July 19, 2019 17:51
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 19, 2019
@ezyang
Copy link
Contributor

ezyang commented Jul 19, 2019

@xuhdev There is only a janky way, which is to edit .circleci so that it adds the other environments, and then push a temporary commit for the testing. Go to .circleci/cimodel/data/pytorch_build_data.py, edit the configs you want to be XImportant, and then run .circleci/regenerate.sh

@xuhdev xuhdev force-pushed the build/detect-nccl branch from 8c162a3 to 7ac63e7 Compare July 22, 2019 20:32
@pytorchbot pytorchbot added the module: ci Related to continuous integration label Jul 22, 2019
@xuhdev xuhdev force-pushed the build/detect-nccl branch 7 times, most recently from d998fda to f9ae21f Compare July 22, 2019 23:22
xuhdev added 2 commits July 22, 2019 20:33
…ipt.

---

How does the current code subsume all detections in the deleted `nccl.py`?

- The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`.

- The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this.

- `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`.

- Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`.
  * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA.  `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled.

  * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly:
     - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system.
     - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ).

---

Regarding for relevant issues:

- pytorch#12063 and pytorch#2877: These are properly handled, as explained in the updated comment.
- pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection).
- b7e258f A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests:

> When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.
@xuhdev xuhdev force-pushed the build/detect-nccl branch from f9ae21f to ceb772a Compare July 23, 2019 03:33
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 23, 2019

@ezyang The PR should be ready now. The error before was caused by the default value of USE_SYSTEM_NCCL -- it was 0 but I changed it to 1 (The Docker image contains no system NCCL). Feel free to review when you have time!

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 23, 2019

There are currently two commits in this PR. The first one is the real content and the second one is CI tweak.

@ezyang
Copy link
Contributor

ezyang commented Jul 23, 2019

This LGTM. I'll go ahead and remove the CI twiddle and land.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 60c46dd.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 23, 2019
…ipt. (#22930)

Summary:
 ---

How does the current code subsume all detections in the deleted `nccl.py`?

- The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`.

- The main NCCL detection happens in [FindNCCL.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](https://github.com/pytorch/pytorch/blob/8377d4b32c12206a0f9401e81a5e5796c8fc01a8/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this.

- `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`.

- Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`.
  * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA.  `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled.

  * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly:
     - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system.
     - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ).

 ---

Regarding for relevant issues:

- pytorch/pytorch#12063 and pytorch/pytorch#2877: These are properly handled, as explained in the updated comment.
- pytorch/pytorch#2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection).
- b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests:

> When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.
Pull Request resolved: pytorch/pytorch#22930

Differential Revision: D16440275

Pulled By: ezyang

fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa
@xuhdev xuhdev deleted the build/detect-nccl branch July 23, 2019 17:09
@jithunnair-amd
Copy link
Collaborator

@xuhdev Do I interpret the below change correctly to mean that USE_SYSTEM_NCCL will always be set to OFF by cmake, unless the user explicitly sets it to ON during the build using the USE_SYSTEM_NCCL=1 env variable?

cmake_dependent_option(
    USE_SYSTEM_NCCL "Use system-wide NCCL" OFF
    "USE_NCCL" OFF)

@xuhdev
Copy link
Collaborator Author

xuhdev commented Aug 1, 2019

@jithunnair-amd What you said is right, but both for before and after this change. The changed lines you have quoted also make sure that the option USE_SYSTEM_NCCL does not show up if USE_NCCL is OFF (e.g., in cmake-gui, ccmake, etc.)

@jithunnair-amd
Copy link
Collaborator

@xuhdev Need some cmake expertise: in the file CMakeLists.txt, I need to ensure that USE_NCCL can be set for either CUDA or ROCM builds (for which the option will enable the RCCL library instead). How do I change the below to enable that?

cmake_dependent_option(
    USE_NCCL "Use NCCL" ON
"USE_CUDA;UNIX;NOT APPLE" OFF)

@xuhdev
Copy link
Collaborator Author

xuhdev commented Aug 2, 2019

@jithunnair-amd The cleaning way I think is to add a USE_GPU variable, because USE_CUDA OR USE_ROCM and its equivalent seem being used a lot:

if (USE_CUDA OR USE_ROCM)
  set(USE_GPU TRUE)
else()
  set(USE_GPU FALSE)
endif()

Then replace USE_CUDA in that statement with USE_GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: build Build system issues module: ci Related to continuous integration module: nccl Problems related to nccl support open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants