Skip to content

Undefined reference errors to VSX symbols during build (link phase) on PPC64LE #51132

@dncliss

Description

@dncliss

🐛 Bug

When running a pytorch build on the "ppc64le", since 2020-Dec-11, the nightly CI builds started failing after the
introduction of certain VSX functionality. If using gcc/g++ v8 compiler, the build appears to progress well until
many of the shared libraries and executables are being linked, at which time a series of error messages are produced
of the form (...object name..): undefined reference to (...symbol name...) -- and because each symbol occurance ends with ::VSX I am strongly suspect that it ties back to the function added in December at the very time the failures started. One example of the error:

/home/jenkins/pytorch/build/lib/libtorch_cpu.so: undefined reference to `at::native::DispatchStub<at::Tensor& (*)(at::Tensor&, at::Tensor const&, at::Tensor&, long), at::native::orgqr_stub>::VSX'
/home/jenkins/pytorch/build/lib/libtorch_cpu.so: undefined reference to `at::native::DispatchStub<std::tuple<at::Tensor, at::Tensor> (*)(at::Tensor const&, bool&), at::native::eig_stub>::VSX'
collect2: error: ld returned 1 exit status

The full command which produced the error above, within the build, was:

/usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_VSX_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -L/usr/lib -pthread caffe2/CMakeFiles/mpi_test.dir/mpi/mpi_test.cc.o -o bin/mpi_test  -Wl,-rpath,/usr/lib/powerpc64le-linux-gnu/openmpi/lib:/home/jenkins/pytorch/build/lib:/usr/local/cuda/lib64:  lib/libgtest_main.a  /usr/lib/powerpc64le-linux-gnu/openmpi/lib/libmpi_cxx.so  /usr/lib/powerpc64le-linux-gnu/openmpi/lib/libmpi.so  -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  -Wl,--no-as-needed,"/home/jenkins/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart.so  /usr/local/cuda/lib64/libnvToolsExt.so  /usr/local/cuda/lib64/libcufft.so  /usr/local/cuda/lib64/libcurand.so  /usr/local/cuda/lib64/libcublas.so  /usr/lib/powerpc64le-linux-gnu/libcudnn.so  lib/libgtest.a  -pthread

To Reproduce

This is challenging as you'd have to be on ppc64le (power) hardware. However, you can view full output of the nightly CI builds of pytorch master branch, and (if desired) I have access to alter the build scripting for any changes that may be needed. (i.e. I recently updated it to install v8 gcc/g++).

The GPU-based executions can be seen here:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/
So my suggestion is to open a recent run, select the console output to view the full run, and search on "undefined reference" in the output. For example, a recent run takes you to this output:
https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/1045/consoleFull

There's an equivalent cpu nightly CI run at the same OSU lab, but it also fails in the same way so it doesn't likely offer anything special to look at. But for reference if desired: https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/

Expected behavior

Build should complete successfully (as was the case pre-Dec11).

Environment

PPC64LE in a Ubuntu18.04 docker container, building pytorch master branch.
I am doubtful that other env data (like python version, etc) is a factor here.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet @seemethere @walterddr

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: buildBuild system issuesmodule: static linkingRelated to statically linked libtorch (we dynamically link by default)triage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions