Tests should fail indicating actual number of GPUs is below desired world_size

I faced this cryptic error while running tests on a device with a single GPU.

DeepSpeed: `master`
PyTroch: `1.12.1`
NCCL: `2.10.3`

### Current Behavior

Steps to reproduce:
1. `pytest tests/unit/checkpoint/test_moe_checkpoint.py -k 'test_checkpoint_moe_and_zero'`
2. Observe:
```
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
```

It's hard to figure out why the tests failed, I was only able to fix it when I saw https://github.com/microsoft/DeepSpeed/issues/2482

In this particular case there's no indication what's wrong even though at runtime we know the desired world_size and actual number of devices.

### Expected Behavior

Test should fail saying something like `num_gpus < world_size`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tests should fail indicating actual number of GPUs is below desired world_size #2733

Current Behavior

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tests should fail indicating actual number of GPUs is below desired world_size #2733

Description

Current Behavior

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions