DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 0, nvl: 4, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f718176c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f71817166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7181b73a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f7181b3a92e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f7181b3ba57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f7181b3bc5f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f718059af70 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f718174d69f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f718174637b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7181746529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f7180861a98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7180861de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181758 (0x5570898a4758 in /usr/bin/python)
frame #13: <unknown function> + 0x1949e8 (0x5570898b79e8 in /usr/bin/python)
frame #14: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #15: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #16: <unknown function> + 0x1a08bf (0x5570898c38bf in /usr/bin/python)
frame #17: <unknown function> + 0x15f9d6 (0x5570898829d6 in /usr/bin/python)
frame #18: <unknown function> + 0x2941a7 (0x5570899b71a7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x5757 (0x55708989da27 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x818 (0x557089898ae8 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6d2 (0x5570898989a2 in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1a22 (0x557089899cf2 in /usr/bin/python)
frame #26: <unknown function> + 0x25ae56 (0x55708997de56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x55708997dd26 in /usr/bin/python)
frame #28: <unknown function> + 0x281ae8 (0x5570899a4ae8 in /usr/bin/python)
frame #29: <unknown function> + 0x27c2ef (0x55708999f2ef in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x557089998f61 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x557089998e11 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x557089998140 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x557089971d6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f7182671d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f7182671e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x557089971c65 in /usr/bin/python)
When I run the across-node test with
MASTER_ADDR=<ip> MASTER_PORT=30001 WORLD_SIZE=2 RANK=0 python test_internode.pyon 2*H20 nodes, I got the following timeout log:This issue only happens after the Multi-QP patch: 5ab80c2 is merged.
It's probably related with multi-QP.