Skip to content

[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes #137

@jeffye-dev

Description

@jeffye-dev

When I run the across-node test with MASTER_ADDR=<ip> MASTER_PORT=30001 WORLD_SIZE=2 RANK=0 python test_internode.py on 2*H20 nodes, I got the following timeout log:

DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 0, nvl: 4, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f718176c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f71817166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7181b73a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f7181b3a92e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f7181b3ba57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f7181b3bc5f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f718059af70 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f718174d69f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f718174637b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7181746529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f7180861a98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7180861de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181758 (0x5570898a4758 in /usr/bin/python)
frame #13: <unknown function> + 0x1949e8 (0x5570898b79e8 in /usr/bin/python)
frame #14: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #15: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #16: <unknown function> + 0x1a08bf (0x5570898c38bf in /usr/bin/python)
frame #17: <unknown function> + 0x15f9d6 (0x5570898829d6 in /usr/bin/python)
frame #18: <unknown function> + 0x2941a7 (0x5570899b71a7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x5757 (0x55708989da27 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x818 (0x557089898ae8 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6d2 (0x5570898989a2 in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1a22 (0x557089899cf2 in /usr/bin/python)
frame #26: <unknown function> + 0x25ae56 (0x55708997de56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x55708997dd26 in /usr/bin/python)
frame #28: <unknown function> + 0x281ae8 (0x5570899a4ae8 in /usr/bin/python)
frame #29: <unknown function> + 0x27c2ef (0x55708999f2ef in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x557089998f61 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x557089998e11 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x557089998140 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x557089971d6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f7182671d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f7182671e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x557089971c65 in /usr/bin/python)

This issue only happens after the Multi-QP patch: 5ab80c2 is merged.
It's probably related with multi-QP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions