Skip to content

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch) #2119

@asaparov

Description

@asaparov

I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU).
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom. (source for the script is here)

The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:

...
gr061: loading checkpoint (68)
gr061: loading checkpoint (69)
gr061: loading checkpoint (70)
gr063: [2022-07-20 19:03:10,723] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: loading checkpoint (71)
gr061: [2022-07-20 19:03:11,443] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: *** Starting to generate 100 tokens with bs=1
gr061: Generate args {'max_new_tokens': 100, 'do_sample': False}
gr064: [2022-07-20 19:03:12,551] [INFO] [engine.py:144:__init__] Place model to device: 3
gr061: [2022-07-20 19:03:13,294] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:14,244] [INFO] [engine.py:144:__init__] Place model to device: 2
gr062: [2022-07-20 19:03:14,406] [INFO] [engine.py:144:__init__] Place model to device: 0
gr063: [2022-07-20 19:03:14,791] [INFO] [engine.py:144:__init__] Place model to device: 2
gr064: [2022-07-20 19:03:15,444] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,542] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,618] [INFO] [engine.py:144:__init__] Place model to device: 1
gr062: [2022-07-20 19:03:16,179] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:16,513] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: [2022-07-20 19:03:16,777] [INFO] [engine.py:144:__init__] Place model to device: 0
gr064: [2022-07-20 19:03:17,541] [INFO] [engine.py:144:__init__] Place model to device: 1
gr063: [2022-07-20 19:03:18,336] [INFO] [engine.py:144:__init__] Place model to device: 3
gr063: [2022-07-20 19:03:18,547] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064:     return func(*args, **kwargs)
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:     outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = self(
gr064:     outputs = self(
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:       File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:         outputs = self.model_orig_fwd(*inputs, **kwargs)outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:                 transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(
gr064:
gr064:
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:         return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064: outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:     return cdb.all_reduce(tensor, op, group, async_op)  File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064:     work = group.allreduce([tensor], opts)
gr064: RuntimeErrorRuntimeError: :     NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError
gr064: work = group.allreduce([tensor], opts)
gr064: :
gr064: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fae5f70b477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fae8ccfc4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fae8cd02417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fae9f4f0c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fae5f6eed95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fae9f3e5b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fae9f719fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fae9f71a2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x55ccd72e1e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x55ccd72eead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x55ccd73027ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x55ccd72d6661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x55ccd72dc81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x55ccd73ceaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x55ccd73cdf56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x55ccd73c12b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x55ccd7393b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7faee4a9a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x55ccd7393a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f183ee2a477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f186c41b4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f186c421417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f187ec0fc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f183ee0dd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f187eb04b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f187ee38fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f187ee392c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616533d4e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616533e1ad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616533f57ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616533c9661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616533cf81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5616534c1aec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5616534c0f56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5616534b42b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561653486b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f18c41b90b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561653486a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb213ab8477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fb2410a94a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fb2410af417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fb25389dc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb213a9bd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fb253792b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fb253ac6fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fb253ac72c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616125aee28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616125bbad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616125cf7ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616125a3661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616125a981a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x56161269baec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x56161269af56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x56161268e2b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561612660b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7fb298e470b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561612660a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8724e9e477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f875248f4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8752495417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f8764c83c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8724e81d95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f8764b78b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f8764eacfc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f8764ead2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5640a0321e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5640a032ead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5640a03427ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5640a0316661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5640a031c81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5640a040eaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5640a040df56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5640a04012b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x5640a03d3b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f87aa22d0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x5640a03d3a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: [2022-07-20 19:03:32,219] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678791
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678792
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678793
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678794
gr064: [2022-07-20 19:03:32,220] [ERROR] [launch.py:184:sigkill_handler] ['/ext3/miniconda3/bin/python3.9', '-u', 'Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=3', '--name', 'bigscience/bloom'] exits with return code = -6
pdsh@gr061: gr064: ssh exited with exit code 250
pdsh@gr061: gr062: ssh exited with exit code 250
pdsh@gr061: gr061: ssh exited with exit code 250

I've tried with CUDA 10.2 and 11.6 and there's no difference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions