Skip to content

[BUG] Impossible to prebuild w/o having at least one gpu #2010

@stas00

Description

@stas00

Describe the bug

The goal is to cross-compile on a VM that has no GPUs for other VMs that have GPUs (preparing docker images for CIs). Currently the prebuilding fails.

To Reproduce
Steps to reproduce the behavior:

Normally when gpus are enabled all work., Now let's emulate no GPUs installed by adding CUDA_VISIBLE_DEVICES= and repeat the same prebuild command - and then it fails.

$ CUDA_VISIBLE_DEVICES= TORCH_CUDA_ARCH_LIST="6.1;8.0;8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1  DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 
WARNING: Disabling all use of wheels due to the use of --build-option / --global-option / --install-option.
Using pip 22.0.4 from /home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/pip (python 3.8)
Obtaining file:///mnt/nvme0/code/github/00optimize/deepspeed
  Preparing metadata (setup.py): started
  Running command python setup.py egg_info
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/mnt/nvme0/code/github/00optimize/deepspeed/setup.py", line 238, in <module>
      bf16_support = torch.cuda.is_bf16_supported()
    File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 92, in is_bf16_supported
      return torch.cuda.get_device_properties(torch.cuda.current_device()).major >= 8 and cuda_maj_decide
    File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 481, in current_device
      _lazy_init()
    File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
      torch._C._cuda_init()
  RuntimeError: No CUDA GPUs are available
  No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.6'
  [WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6.0, 6.1, 6.2)
  DS_BUILD_OPS=0
  Installed CUDA version 11.6 does not match the version torch was compiled with 11.5 but since the APIs are compatible, accepting this combination
  Install Ops={'cpu_adam': 1, 'cpu_adagrad': False, 'fused_adam': False, 'fused_lamb': False, 'sparse_attn': False, 'transformer': False, 'stochastic_transformer': False, 'async_io': 1, 'utils': 1, 'quantizer': False, 'transformer_inference': False}
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.

Expected behavior

I wonder if perhaps we need a new flag that tells the prebuild not to check if there is an actual GPU installed?

We are using TORCH_CUDA_ARCH_LIST to cross-compile for the target gpus.

ds_report output

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu115
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed']
deepspeed info ................... 0.6.6+828ab718, 828ab718, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 21.10]
  • Python version 3.8

Thank you.

@jeffra

cc: @ydshieh, who originally reported this

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions