Skip to content

[Bug] test_internode.py timeout on 4 * 8 H20 #158

@cscyuge

Description

@cscyuge

test_internode.py passed on 2 * 8 H20,
but timeout on 4 * 8 H20,
already modified kNumWarpsPerGroup and kNumWarpGroups in DeepEP/csrc/kernels/internode_ll.cu to 8 and 4 , respectively, (referring to #15 (comment)).

script:

# node 0
NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=my_master_node WORLD_SIZE=4 RANK=0 python test_internode.py
# node 1
NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=my_master_node WORLD_SIZE=4 RANK=1 python test_internode.py
# node 2
NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=my_master_node WORLD_SIZE=4 RANK=2 python test_internode.py
# node 3
NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=my_master_node WORLD_SIZE=4 RANK=3 python test_internode.py

logs on node 1:

[config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8
[layout] Kernel performance: 0.055 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 29.82 GB/s (RDMA), 59.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 30.19 GB/s (RDMA), 60.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 29.22 GB/s (RDMA), 58.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 29.30 GB/s (RDMA), 58.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 27.59 GB/s (RDMA), 55.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 29.90 GB/s (RDMA), 59.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 30.81 GB/s (RDMA), 61.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 30.61 GB/s (RDMA), 61.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 30.88 GB/s (RDMA), 61.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 21.77 GB/s (RDMA), 43.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 28.78 GB/s (RDMA), 57.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 30.03 GB/s (RDMA), 59.91 GB/s (NVL)
^@DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 5, start: 0, end: 0
...(similar logs)
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 3, meta: 0, 0, 0, 0
...(similar logs)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5b5696c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5b56915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5b56dcc918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x20d8e (0x7f5b56d92d8e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f5b56d94507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f5b56d9470f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f5b4e2d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f5b5694d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f5b5694633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5b569464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f5b4e58dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f5b4e58e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x560129f7a370 in /usr/bin/python)
frame #13: <unknown function> + 0x194588 (0x560129f8d588 in /usr/bin/python)
frame #14: <unknown function> + 0x19459c (0x560129f8d59c in /usr/bin/python)
frame #15: <unknown function> + 0x19459c (0x560129f8d59c in /usr/bin/python)
frame #16: <unknown function> + 0x1a04af (0x560129f994af in /usr/bin/python)
frame #17: <unknown function> + 0x15f986 (0x560129f58986 in /usr/bin/python)
frame #18: <unknown function> + 0x292ec7 (0x56012a08bec7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x54c7 (0x560129f73737 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x804 (0x560129f6ea74 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6bf (0x560129f6e92f in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x18d3 (0x560129f6fb43 in /usr/bin/python)
frame #26: <unknown function> + 0x259f56 (0x56012a052f56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x56012a052e26 in /usr/bin/python)
frame #28: <unknown function> + 0x280808 (0x56012a079808 in /usr/bin/python)
frame #29: <unknown function> + 0x27b00f (0x56012a07400f in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x56012a06dd91 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x56012a06dc41 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x56012a06cf70 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x56012a046e6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f5b5774dd90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f5b5774de40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x56012a046d65 in /usr/bin/python)

W0513 06:52:49.383000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188788 via signal SIGTERM
W0513 06:52:49.384000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188789 via signal SIGTERM
W0513 06:52:49.385000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188790 via signal SIGTERM
W0513 06:52:49.387000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188791 via signal SIGTERM
W0513 06:52:49.388000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188792 via signal SIGTERM
W0513 06:52:49.391000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188793 via signal SIGTERM
W0513 06:52:49.393000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188794 via signal SIGTERM
Traceback (most recent call last):
  File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 247, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 7 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 235, in test_loop
    test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
  File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 179, in test_main
    t = bench(lambda: buffer.dispatch(**tune_args))[0]
  File "/mnt/yscfs/linjunxian/DeepEP/tests/utils.py", line 96, in bench
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 985, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions