-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
when running
pytest ./tests/unit/checkpoint/test_lr_scheduler.py
It fails with
_____________________________________________ TestLRSchedulerCheckpoint.test_checkpoint_lr_scheduler[0-False] ______________________________________________
Worker 0 exited with code 1
------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------
[2022-11-07 09:30:39,330] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------
Process Process-1:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
dist.barrier()
File "/mnt/c/Users/username/DeepSpeed/tests/unit/common.py", line 166, in _dist_init
dist.barrier()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
return func(*args, **kwargs)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper
return func(*args, **kwargs)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
return cdb.barrier()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 458, in barrier
return cdb.barrier()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
return torch.distributed.barrier()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 153, in barrier
return torch.distributed.barrier()
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
work = default_pg.barrier(opts=opts)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1667808731763/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1272, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
Note that apparently NCCL can be used on a single GPU but it can only be a single rank,
It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error.
Note same occurs in many other pytests, such as
tests/unit/checkpoint/test_moe_checkpoint.py
tests/unit/checkpoint/test_other_optimizer.py
tests/unit/checkpoint/test_pipeline.py
tests/unit/checkpoint/test_sparse.py
tests/unit/checkpoint/test_tag_validation.py
tests/unit/checkpoint/test_zero_optimizer.py
To Reproduce
Steps to reproduce the behavior:
run
pytest ./tests/unit/checkpoint/test_lr_scheduler.py
Expected behavior
A clear and concise description of what you expected to happen.
Test should pass
ds_report output
Please run ds_report to give us details about your setup.
note running under wsl2 ubuntu 20.04, occurs with both the install from pip, and install from source, with or without the extensions precompiled.
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch']
torch version .................... 1.14.0.dev20221107
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed install path ........... ['/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5+10e9d04c, 10e9d04c, master
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.7
System info (please complete the following information):
- OS: Ubuntu 20.04 under WSL2
- GPU count and types - 1 GPU on single machine 3060 (Laptop variant) (Note I have an integrated AMD GPU also, but don't think it is detected/relevant)
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions - huggingface and transformers are current head
- Python version - python 3.9