Skip to content

Tests should fail indicating actual number of GPUs is below desired world_size #2733

@clumsy

Description

@clumsy

I faced this cryptic error while running tests on a device with a single GPU.

DeepSpeed: master
PyTroch: 1.12.1
NCCL: 2.10.3

Current Behavior

Steps to reproduce:

  1. pytest tests/unit/checkpoint/test_moe_checkpoint.py -k 'test_checkpoint_moe_and_zero'
  2. Observe:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

It's hard to figure out why the tests failed, I was only able to fix it when I saw #2482

In this particular case there's no indication what's wrong even though at runtime we know the desired world_size and actual number of devices.

Expected Behavior

Test should fail saying something like num_gpus < world_size.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions