Skip to content

[BUG] NCCL rank 1 and rank 0 on same GPU during pytest for test_lr_scheduler.py - single CUDA GPU computer #2482

@Thomas-MMJ

Description

@Thomas-MMJ

Describe the bug
when running

pytest ./tests/unit/checkpoint/test_lr_scheduler.py

It fails with

_____________________________________________ TestLRSchedulerCheckpoint.test_checkpoint_lr_scheduler[0-False] ______________________________________________
Worker 0 exited with code 1
------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------
[2022-11-07 09:30:39,330] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------
Process Process-1:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
    dist.barrier()
  File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
    dist.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
    return func(*args, **kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
    return func(*args, **kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
    return cdb.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
    return cdb.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
    return torch.distributed.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
    return torch.distributed.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000

Note that apparently NCCL can be used on a single GPU but it can only be a single rank,

It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error.

NVIDIA/nccl#103

Note same occurs in many other pytests, such as

tests/unit/checkpoint/test_moe_checkpoint.py
tests/unit/checkpoint/test_other_optimizer.py
tests/unit/checkpoint/test_pipeline.py
tests/unit/checkpoint/test_sparse.py
tests/unit/checkpoint/test_tag_validation.py
tests/unit/checkpoint/test_zero_optimizer.py

To Reproduce
Steps to reproduce the behavior:

run

pytest ./tests/unit/checkpoint/test_lr_scheduler.py

Expected behavior
A clear and concise description of what you expected to happen.

Test should pass

ds_report output
Please run ds_report to give us details about your setup.

note running under wsl2 ubuntu 20.04, occurs with both the install from pip, and install from source, with or without the extensions precompiled.

 ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch']
torch version .................... 1.14.0.dev20221107
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed install path ........... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5+10e9d04c, 10e9d04c, master
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.7

System info (please complete the following information):

  • OS: Ubuntu 20.04 under WSL2
  • GPU count and types - 1 GPU on single machine 3060 (Laptop variant) (Note I have an integrated AMD GPU also, but don't think it is detected/relevant)
  • (if applicable) Hugging Face Transformers/Accelerate/etc. versions - huggingface and transformers are current head
  • Python version - python 3.9

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions