Skip to content

RuntimeError: NCCL Error 1: unhandled cuda error #11756

@chenwc07

Description

@chenwc07

Issue description

Get NCCL Error 1: unhandled cuda error when using DataParallel
I wonder what's wrong with it because it works when using only 1 GPU, and cuda9/cuda8 got the same problem

Code example

I ran:
testdata = torch.rand(12,3,112,112)
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3]).cuda()
out = model(testdata)

then i got:
RuntimeError Traceback (most recent call last)
in ()
----> 1 out = model(testdata)

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
120 if len(self.device_ids) == 1:
121 return self.module(*inputs[0], **kwargs[0])
--> 122 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
123 outputs = self.parallel_apply(replicas, inputs, kwargs)
124 return self.gather(outputs, self.output_device)

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py in replicate(self, module, device_ids)
125
126 def replicate(self, module, device_ids):
--> 127 return replicate(module, device_ids)
128
129 def scatter(self, inputs, kwargs, device_ids):

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/replicate.py in replicate(network, devices, detach)
10 params = list(network.parameters())
11 param_indices = {param: idx for idx, param in enumerate(params)}
---> 12 param_copies = Broadcast.apply(devices, *params)
13 if len(params) > 0:
14 param_copies = [param_copies[i:i + len(params)]

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/_functions.py in forward(ctx, target_gpus, *inputs)
17 ctx.num_inputs = len(inputs)
18 ctx.input_device = inputs[0].get_device()
---> 19 outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
20 non_differentiables = []
21 for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):

/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/cuda/comm.py in broadcast_coalesced(tensors, devices, buffer_size)
38 corresponding to indices from devices.
39 """
---> 40 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
41
42

RuntimeError: NCCL Error 1: unhandled cuda error

System Info

  • PyTorch
  • How you installed PyTorch : conda
  • OS: ubuntu-server-16.04
  • PyTorch version: 0.4.1
  • Python version: 3.5
  • CUDA/cuDNN version: cuda8, cudnn6 and cuda9,cudnn7 got the same problem
  • GPU models and configuration: Nvidia Titan Xp * 4
  • GCC version (if compiling from source): 5.4
  • CMake version:
  • Versions of any other relevant libraries:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions