-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
The goal is to cross-compile on a VM that has no GPUs for other VMs that have GPUs (preparing docker images for CIs). Currently the prebuilding fails.
To Reproduce
Steps to reproduce the behavior:
Normally when gpus are enabled all work., Now let's emulate no GPUs installed by adding CUDA_VISIBLE_DEVICES= and repeat the same prebuild command - and then it fails.
$ CUDA_VISIBLE_DEVICES= TORCH_CUDA_ARCH_LIST="6.1;8.0;8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
WARNING: Disabling all use of wheels due to the use of --build-option / --global-option / --install-option.
Using pip 22.0.4 from /home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/pip (python 3.8)
Obtaining file:///mnt/nvme0/code/github/00optimize/deepspeed
Preparing metadata (setup.py): started
Running command python setup.py egg_info
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/mnt/nvme0/code/github/00optimize/deepspeed/setup.py", line 238, in <module>
bf16_support = torch.cuda.is_bf16_supported()
File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 92, in is_bf16_supported
return torch.cuda.get_device_properties(torch.cuda.current_device()).major >= 8 and cuda_maj_decide
File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 481, in current_device
_lazy_init()
File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.6'
[WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6.0, 6.1, 6.2)
DS_BUILD_OPS=0
Installed CUDA version 11.6 does not match the version torch was compiled with 11.5 but since the APIs are compatible, accepting this combination
Install Ops={'cpu_adam': 1, 'cpu_adagrad': False, 'fused_adam': False, 'fused_lamb': False, 'sparse_attn': False, 'transformer': False, 'stochastic_transformer': False, 'async_io': 1, 'utils': 1, 'quantizer': False, 'transformer_inference': False}
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
Expected behavior
I wonder if perhaps we need a new flag that tells the prebuild not to check if there is an actual GPU installed?
We are using TORCH_CUDA_ARCH_LIST to cross-compile for the target gpus.
ds_report output
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu115
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed']
deepspeed info ................... 0.6.6+828ab718, 828ab718, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 21.10]
- Python version 3.8
Thank you.
cc: @ydshieh, who originally reported this