Skip to content

Let CMake handle NCCL detection instead of our handcrafted Python script.#22818

Closed
xuhdev wants to merge 8 commits intogh/xuhdev/13/basefrom
gh/xuhdev/13/head
Closed

Let CMake handle NCCL detection instead of our handcrafted Python script.#22818
xuhdev wants to merge 8 commits intogh/xuhdev/13/basefrom
gh/xuhdev/13/head

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Jul 12, 2019

Stack from ghstack:

  • USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
    options that depend on other options.
  • Detections using folders such as lib64, lib64/ are automatically
    handled in find_library in CMake.
  • NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
    standard CMake variable that specifies the root dir of a library (and
    therefore supports a wide range of automatic detection). E.g., see
    https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

How does the current code subsume all detections in the deleted nccl.py?

  • The dependency of USE_NCCL on the OS and USE_CUDA is handled as dependency options in CMakeLists.txt.

  • The main NCCL detection happens in FindNCCL.cmake, which is called by nccl.cmake. When USE_SYSTEM_NCCL is false, the previous Python code defer the detection to find_package(NCCL). The change in nccl.cmake retains this.

  • USE_STATIC_NCCL in the previous Python code simply changes the name of the detected library. This is done in IF (USE_STATIC_NCCL).

  • Now we only need to look at how the lines below line 20 in nccl.cmake are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by find_path for headers and find_library for the library files in FindNCCL.cmake.

    • The call of find_path (Search for NO_DEFAULT_PATH in the link) by default searches for headers in <prefix>/include for each <prefix> in CMAKE_PREFIX_PATH and CMAKE_SYSTEM_PREFIX_PATH. Like the Python code, this commit sets CMAKE_PREFIX_PATH to search for <prefix> in NCCL_ROOT_DIR and home to CUDA. CMAKE_SYSTEM_PREFIX_PATH includes the standard directories such as /usr/local and /usr. NCCL_INCLUDE_DIR is also specifically handled.

    • Similarly, the call of find_library (Search for NO_DEFAULT_PATH in the link) by default searches for libraries in directories including <prefix>/lib for each <prefix> in CMAKE_PREFIX_PATH and CMAKE_SYSTEM_PREFIX_PATH. But it also handles the edge cases intended to be solved in the Python code more properly:

      • It only searches for <prefix>/lib64 (and <prefix>/lib32) if it is appropriate on the system.
      • It only searches for <prefix>/lib/<arch> for the right <arch>, unlike the Python code searches for lib/<arch> in a generic way (e.g., the Python code searches for /usr/lib/x86_64-linux-gnu but in reality systems have /usr/lib/x86_64-some-customized-name-linux-gnu, see https://unix.stackexchange.com/a/226180/38242 ).

Regarding for relevant issues:

When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.

Differential Revision: D16281714

…ipt.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html
@lemairecarl
Copy link
Contributor

lemairecarl commented Jul 12, 2019

Hi! I'm wondering if this PR will fix my issue. I'm having problems compiling with libtorch on a Linux installation where CUDA is not in /usr/local/cuda.
I found a workaround: I remove the references to that folder in share/cmake/Caffe2/Caffe2Targets.cmake. I use this command:

sed -i -e 's/\/usr\/local\/cuda\/lib64\/libculibos.a;dl;\/usr\/local\/cuda\/lib64\/libculibos.a;//g' share/cmake/Caffe2/Caffe2Targets.cmake

Do you think this PR handles my case?

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 12, 2019

Hi! I'm wondering if this PR will fix my issue. I'm having problems compiling with libtorch on a Linux installation where CUDA is not in /usr/local/cuda.
I found a workaround: I remove the references to that folder in share/cmake/Caffe2/Caffe2Targets.cmake. I use this command:

sed -i -e 's/\/usr\/local\/cuda\/lib64\/libculibos.a;dl;\/usr\/local\/cuda\/lib64\/libculibos.a;//g' share/cmake/Caffe2/Caffe2Targets.cmake

Do you think this PR handles my case?

This PR is about NCCL detection, not CUDA. You can try to set CUDA=/path/to/your/cuda/home. If it still doesn't work, feel free to start a discussion or open a new issue!

… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2019
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
@ezyang
Copy link
Contributor

ezyang commented Jul 15, 2019

After this PR the environment variable NCCL_ROOT_DIR is no longer respected. It should be, especially because gloo has its own copy of Findnccl and will respect the envvar. (Arguably, we should make sure that we feed gloo cmake our own detection instead of using its settings.)

@ezyang ezyang requested a review from kostmo July 15, 2019 13:45
SET(NCCL_LIBNAME "nccl")
ENDIF()
if (NCCL_VERSION) # Prefer the versioned library if a specific NCCL version is specified
set(CMAKE_FIND_LIBRARY_SUFFIXES ".so.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the placement of this line. It occurs inside the USE_STATIC_NCCL block. Doesn't that mean that it will never actually usefully trigger (because we'll be linking against static nccl?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is under the else-clause? This line will be triggered only if USE_STATIC_NCCL is false.

${NCCL_ROOT_DIR}/lib/x86_64-linux-gnu
${NCCL_ROOT_DIR}/lib64
${CUDA_TOOLKIT_ROOT_DIR}/lib64)
HINTS ${NCCL_LIB_DIR})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused by what the manual says here; I think in practice this means that we will stop looking in lib64 location because cmake will only look in that directory if FIND_LIBRARY_USE_LIB64_PATHS is set (and I don't think it is set on ordinary platforms). I don't think this matters though, because I checked some of my system nccl installs and they appear to be in non-lib64 directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FIND_LIBRARY_USE_LIB64_PATHS is set when it is proper (such as a multiarch system) to do so. From the document of find_library:

If the FIND_LIBRARY_USE_LIB64_PATHS global property is set all search paths will be tested as normal, with 64/ appended, and with all matches of lib/ replaced with lib64/. This property is automatically set for the platforms that are known to need it if at least one of the languages supported by the project() command is enabled.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good but I am concerned about the loss of NCCL_ROOT_DIR envvar support. (The dynamic library version pinning is also probably broken, but that error seems pretty benign.)

xuhdev added 5 commits July 15, 2019 11:01
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
… Python script."

Let CMake handle NCCL detection instead of our handcrafted Python script.

- USE_NCCL, USE_SYSTEM_NCCL, USE_STATIC_NCCL are made proper CMake build
  options that depend on other options.
- Detections using folders such as lib64, lib64/<arch> are automatically
  handled in find_library in CMake.
- NCCL_ROOT_DIR in findNCCL.cmake is changed to NCCL_ROOT, which is the
  standard CMake variable that specifies the root dir of a library (and
  therefore supports a wide range of automatic detection). E.g., see
  https://cmake.org/cmake/help/latest/envvar/PackageName_ROOT.html

gh-metadata: pytorch pytorch 22818 gh/xuhdev/13/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 15, 2019

After this PR the environment variable NCCL_ROOT_DIR is no longer respected. It should be, especially because gloo has its own copy of Findnccl and will respect the envvar. (Arguably, we should make sure that we feed gloo cmake our own detection instead of using its settings.)

Do you also wanna use this findNCCL once review is finished? This copy of findNCCL is actually pretty generic.

@xuhdev xuhdev requested a review from ezyang July 15, 2019 19:11
@ezyang
Copy link
Contributor

ezyang commented Jul 16, 2019

Do you also wanna use this findNCCL once review is finished? This copy of findNCCL is actually pretty generic.

I think the gloo bits live in another repository, so we can't easily replace it.

@zou3519
Copy link
Contributor

zou3519 commented Jul 16, 2019

@ezyang could you land this when you get a chance?

@zhangguanheng66
Copy link
Contributor

@ezyang Same request as Richard. Thanks.

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 18, 2019

I reopened this is #22930 . For some reason this actually has already been landed but was reverted later. The stack here is actually meaningless now.

@ezyang
Copy link
Contributor

ezyang commented Jul 19, 2019

Thanks, I'll land that one.

@ezyang ezyang closed this Jul 19, 2019
@facebook-github-bot facebook-github-bot deleted the gh/xuhdev/13/head branch October 28, 2019 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: build Build system issues module: nccl Problems related to nccl support open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants