-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Labels
Description
I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU).
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom. (source for the script is here)
The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:
...
gr061: loading checkpoint (68)
gr061: loading checkpoint (69)
gr061: loading checkpoint (70)
gr063: [2022-07-20 19:03:10,723] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: loading checkpoint (71)
gr061: [2022-07-20 19:03:11,443] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: *** Starting to generate 100 tokens with bs=1
gr061: Generate args {'max_new_tokens': 100, 'do_sample': False}
gr064: [2022-07-20 19:03:12,551] [INFO] [engine.py:144:__init__] Place model to device: 3
gr061: [2022-07-20 19:03:13,294] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:14,244] [INFO] [engine.py:144:__init__] Place model to device: 2
gr062: [2022-07-20 19:03:14,406] [INFO] [engine.py:144:__init__] Place model to device: 0
gr063: [2022-07-20 19:03:14,791] [INFO] [engine.py:144:__init__] Place model to device: 2
gr064: [2022-07-20 19:03:15,444] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,542] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,618] [INFO] [engine.py:144:__init__] Place model to device: 1
gr062: [2022-07-20 19:03:16,179] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:16,513] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: [2022-07-20 19:03:16,777] [INFO] [engine.py:144:__init__] Place model to device: 0
gr064: [2022-07-20 19:03:17,541] [INFO] [engine.py:144:__init__] Place model to device: 1
gr063: [2022-07-20 19:03:18,336] [INFO] [engine.py:144:__init__] Place model to device: 3
gr063: [2022-07-20 19:03:18,547] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(
gr064:
gr064:
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = block(
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: return forward_call(*input, **kwargs)
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: return cdb.all_reduce(tensor, op, group, async_op) File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064: RuntimeErrorRuntimeError: : NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError
gr064: work = group.allreduce([tensor], opts)
gr064: :
gr064: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fae5f70b477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fae8ccfc4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fae8cd02417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fae9f4f0c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fae5f6eed95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fae9f3e5b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fae9f719fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fae9f71a2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x55ccd72e1e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x55ccd72eead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x55ccd73027ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x55ccd72d6661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x55ccd72dc81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x55ccd73ceaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x55ccd73cdf56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x55ccd73c12b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x55ccd7393b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7faee4a9a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x55ccd7393a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f183ee2a477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f186c41b4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f186c421417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f187ec0fc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f183ee0dd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f187eb04b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f187ee38fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f187ee392c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616533d4e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616533e1ad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616533f57ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616533c9661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616533cf81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5616534c1aec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5616534c0f56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5616534b42b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561653486b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f18c41b90b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561653486a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb213ab8477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fb2410a94a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fb2410af417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fb25389dc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb213a9bd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fb253792b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fb253ac6fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fb253ac72c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616125aee28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616125bbad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616125cf7ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616125a3661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616125a981a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x56161269baec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x56161269af56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x56161268e2b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561612660b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7fb298e470b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561612660a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8724e9e477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f875248f4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8752495417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f8764c83c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8724e81d95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f8764b78b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f8764eacfc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f8764ead2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5640a0321e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5640a032ead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5640a03427ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5640a0316661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5640a031c81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5640a040eaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5640a040df56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5640a04012b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x5640a03d3b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f87aa22d0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x5640a03d3a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: [2022-07-20 19:03:32,219] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678791
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678792
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678793
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678794
gr064: [2022-07-20 19:03:32,220] [ERROR] [launch.py:184:sigkill_handler] ['/ext3/miniconda3/bin/python3.9', '-u', 'Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=3', '--name', 'bigscience/bloom'] exits with return code = -6
pdsh@gr061: gr064: ssh exited with exit code 250
pdsh@gr061: gr062: ssh exited with exit code 250
pdsh@gr061: gr061: ssh exited with exit code 250
I've tried with CUDA 10.2 and 11.6 and there's no difference.