-
Notifications
You must be signed in to change notification settings - Fork 27.4k
RuntimeError: NCCL Error 1: unhandled cuda error #11756
Description
Issue description
Get NCCL Error 1: unhandled cuda error when using DataParallel
I wonder what's wrong with it because it works when using only 1 GPU, and cuda9/cuda8 got the same problem
Code example
I ran:
testdata = torch.rand(12,3,112,112)
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3]).cuda()
out = model(testdata)
then i got:
RuntimeError Traceback (most recent call last)
in ()
----> 1 out = model(testdata)
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
120 if len(self.device_ids) == 1:
121 return self.module(*inputs[0], **kwargs[0])
--> 122 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
123 outputs = self.parallel_apply(replicas, inputs, kwargs)
124 return self.gather(outputs, self.output_device)
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py in replicate(self, module, device_ids)
125
126 def replicate(self, module, device_ids):
--> 127 return replicate(module, device_ids)
128
129 def scatter(self, inputs, kwargs, device_ids):
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/replicate.py in replicate(network, devices, detach)
10 params = list(network.parameters())
11 param_indices = {param: idx for idx, param in enumerate(params)}
---> 12 param_copies = Broadcast.apply(devices, *params)
13 if len(params) > 0:
14 param_copies = [param_copies[i:i + len(params)]
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/parallel/_functions.py in forward(ctx, target_gpus, *inputs)
17 ctx.num_inputs = len(inputs)
18 ctx.input_device = inputs[0].get_device()
---> 19 outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
20 non_differentiables = []
21 for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
/home/zhangd/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/cuda/comm.py in broadcast_coalesced(tensors, devices, buffer_size)
38 corresponding to indices from devices.
39 """
---> 40 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
41
42
RuntimeError: NCCL Error 1: unhandled cuda error
System Info
- PyTorch
- How you installed PyTorch : conda
- OS: ubuntu-server-16.04
- PyTorch version: 0.4.1
- Python version: 3.5
- CUDA/cuDNN version: cuda8, cudnn6 and cuda9,cudnn7 got the same problem
- GPU models and configuration: Nvidia Titan Xp * 4
- GCC version (if compiling from source): 5.4
- CMake version:
- Versions of any other relevant libraries: