🐛 Bug
When running a pytorch build on the "ppc64le", since 2020-Dec-11, the nightly CI builds started failing after the
introduction of certain VSX functionality. If using gcc/g++ v8 compiler, the build appears to progress well until
many of the shared libraries and executables are being linked, at which time a series of error messages are produced
of the form (...object name..): undefined reference to (...symbol name...) -- and because each symbol occurance ends with ::VSX I am strongly suspect that it ties back to the function added in December at the very time the failures started. One example of the error:
/home/jenkins/pytorch/build/lib/libtorch_cpu.so: undefined reference to `at::native::DispatchStub<at::Tensor& (*)(at::Tensor&, at::Tensor const&, at::Tensor&, long), at::native::orgqr_stub>::VSX'
/home/jenkins/pytorch/build/lib/libtorch_cpu.so: undefined reference to `at::native::DispatchStub<std::tuple<at::Tensor, at::Tensor> (*)(at::Tensor const&, bool&), at::native::eig_stub>::VSX'
collect2: error: ld returned 1 exit status
The full command which produced the error above, within the build, was:
/usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_VSX_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -L/usr/lib -pthread caffe2/CMakeFiles/mpi_test.dir/mpi/mpi_test.cc.o -o bin/mpi_test -Wl,-rpath,/usr/lib/powerpc64le-linux-gnu/openmpi/lib:/home/jenkins/pytorch/build/lib:/usr/local/cuda/lib64: lib/libgtest_main.a /usr/lib/powerpc64le-linux-gnu/openmpi/lib/libmpi_cxx.so /usr/lib/powerpc64le-linux-gnu/openmpi/lib/libmpi.so -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch.so" -Wl,--as-needed -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed lib/libprotobuf.a -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed lib/libc10_cuda.so lib/libc10.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/lib/powerpc64le-linux-gnu/libcudnn.so lib/libgtest.a -pthread
To Reproduce
This is challenging as you'd have to be on ppc64le (power) hardware. However, you can view full output of the nightly CI builds of pytorch master branch, and (if desired) I have access to alter the build scripting for any changes that may be needed. (i.e. I recently updated it to install v8 gcc/g++).
The GPU-based executions can be seen here:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/
So my suggestion is to open a recent run, select the console output to view the full run, and search on "undefined reference" in the output. For example, a recent run takes you to this output:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/1045/consoleFull
There's an equivalent cpu nightly CI run at the same OSU lab, but it also fails in the same way so it doesn't likely offer anything special to look at. But for reference if desired: https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/
Expected behavior
Build should complete successfully (as was the case pre-Dec11).
Environment
PPC64LE in a Ubuntu18.04 docker container, building pytorch master branch.
I am doubtful that other env data (like python version, etc) is a factor here.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet @seemethere @walterddr
🐛 Bug
When running a pytorch build on the "ppc64le", since 2020-Dec-11, the nightly CI builds started failing after the
introduction of certain VSX functionality. If using gcc/g++ v8 compiler, the build appears to progress well until
many of the shared libraries and executables are being linked, at which time a series of error messages are produced
of the form
(...object name..): undefined reference to (...symbol name...)-- and because each symbol occurance ends with::VSXI am strongly suspect that it ties back to the function added in December at the very time the failures started. One example of the error:The full command which produced the error above, within the build, was:
To Reproduce
This is challenging as you'd have to be on ppc64le (power) hardware. However, you can view full output of the nightly CI builds of pytorch master branch, and (if desired) I have access to alter the build scripting for any changes that may be needed. (i.e. I recently updated it to install v8 gcc/g++).
The GPU-based executions can be seen here:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/
So my suggestion is to open a recent run, select the console output to view the full run, and search on "undefined reference" in the output. For example, a recent run takes you to this output:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/1045/consoleFull
There's an equivalent cpu nightly CI run at the same OSU lab, but it also fails in the same way so it doesn't likely offer anything special to look at. But for reference if desired: https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/
Expected behavior
Build should complete successfully (as was the case pre-Dec11).
Environment
PPC64LE in a Ubuntu18.04 docker container, building pytorch master branch.
I am doubtful that other env data (like python version, etc) is a factor here.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet @seemethere @walterddr