[BUG] NCCL rank 1 and rank 0 on same GPU during pytest for test_lr_scheduler.py - single CUDA GPU computer

**Describe the bug**
when running

`pytest ./tests/unit/checkpoint/test_lr_scheduler.py`

It fails with
```
_____________________________________________ TestLRSchedulerCheckpoint.test_checkpoint_lr_scheduler[0-False] ______________________________________________
Worker 0 exited with code 1
------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------
[2022-11-07 09:30:39,330] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------
Process Process-1:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
    dist.barrier()
  File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
    dist.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
    return func(*args, **kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
    return func(*args, **kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
    return cdb.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
    return cdb.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
    return torch.distributed.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
    return torch.distributed.barrier()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
```

Note that apparently NCCL can be used on a single GPU but it can only be a single rank,

> It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error.


https://github.com/NVIDIA/nccl/issues/103

Note same occurs in many other pytests, such as

```
tests/unit/checkpoint/test_moe_checkpoint.py
tests/unit/checkpoint/test_other_optimizer.py
tests/unit/checkpoint/test_pipeline.py
tests/unit/checkpoint/test_sparse.py
tests/unit/checkpoint/test_tag_validation.py
tests/unit/checkpoint/test_zero_optimizer.py
```

**To Reproduce**
Steps to reproduce the behavior:

run

` pytest ./tests/unit/checkpoint/test_lr_scheduler.py`



**Expected behavior**
A clear and concise description of what you expected to happen.

Test should pass

**ds_report output**
Please run `ds_report` to give us details about your setup.

note running under wsl2 ubuntu 20.04, occurs with both the install from pip, and install from source, with or without the extensions precompiled.

```
 ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch']
torch version .................... 1.14.0.dev20221107
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed install path ........... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5+10e9d04c, 10e9d04c, master
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.7
```

**System info (please complete the following information):**
 - OS: Ubuntu 20.04 under WSL2
 - GPU count and types - 1 GPU on single machine 3060 (Laptop variant)  (Note I have an integrated AMD GPU also, but don't think it is detected/relevant)
 - (if applicable) Hugging Face Transformers/Accelerate/etc. versions - huggingface and transformers are current head
 - Python version - python 3.9


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] NCCL rank 1 and rank 0 on same GPU during pytest for test_lr_scheduler.py - single CUDA GPU computer #2482

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] NCCL rank 1 and rank 0 on same GPU during pytest for test_lr_scheduler.py - single CUDA GPU computer #2482

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions