Skip to content

[BUG] torch-nightly: linker issue with cpu_adam.so #1625

@stas00

Description

@stas00

When HF CI runs deepspeed tests with torch-nightly - I get multiple issues with cpu_adam.so

I get most tests fail with either

Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf967353a0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

or:

           Traceback (most recent call last):
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 648, in <module>
E               main()
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 570, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/__w/transformers/transformers/src/transformers/trainer.py", line 1163, in train
E               deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
E             File "/__w/transformers/transformers/src/transformers/deepspeed.py", line 406, in deepspeed_init
E               model, optimizer, _, lr_scheduler = deepspeed.initialize(
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
E               engine = DeepSpeedEngine(args=args,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
E               self._configure_optimizer(optimizer, model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
E               basic_optimizer = self._configure_basic_optimizer(model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1191, in _configure_basic_optimizer
E               optimizer = DeepSpeedCPUAdam(model_parameters,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in __init__
E               self.ds_opt_adam = CPUAdamBuilder().load()
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
E               return self.jit_load(verbose)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 402, in jit_load
E               op_module = load(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1130, in load
E               return _jit_compile(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1368, in _jit_compile
E               return _import_module_from_library(name, build_directory, is_python_module)
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1758, in _import_module_from_library
E               module = importlib.util.module_from_spec(spec)
E             File "<frozen importlib._bootstrap>", line 556, in module_from_spec
E             File "<frozen importlib._bootstrap_external>", line 1101, in create_module
E             File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
E           ImportError: /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/cpu_adam.so: undefined symbol: curandCreateGenerator

(e.g. test: test_can_resume_training_normal_0_zero2, but almost all tests fail)

The compilation went through just fine:

Installed CUDA version 11.2 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
Using /github/home/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Creating extension directory /github/home/.cache/torch_extensions/py38_cu111/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...

It must be something specific to that box - since I can't reproduce these problems on my box with the same torch-nightly version / py38.

But if I check on my home box (where things work)

nm ~/.cache/torch_extensions/py38_cu113/cpu_adam/cpu_adam.so | grep curandCreateGenerator
                 U curandCreateGenerator

So curandCreateGenerator is indeed undefined and it's used here:

https://github.com/microsoft/DeepSpeed/blob/91e15593ea4487014114a03c7b4a2a05567fd3f8/csrc/includes/context.h#L46

but for some reason it doesn't cause a problem on my setup. Perhaps it's a linker issue - some library doesn't get properly linked?

Thank you!

@RezaYazdaniAminabadi, @jeffra, @tjruwase

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions