[config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8
[layout] Kernel performance: 0.055 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 29.82 GB/s (RDMA), 59.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 30.19 GB/s (RDMA), 60.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 29.22 GB/s (RDMA), 58.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 29.30 GB/s (RDMA), 58.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 27.59 GB/s (RDMA), 55.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 29.90 GB/s (RDMA), 59.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 30.81 GB/s (RDMA), 61.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 30.61 GB/s (RDMA), 61.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 30.88 GB/s (RDMA), 61.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 21.77 GB/s (RDMA), 43.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 28.78 GB/s (RDMA), 57.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 30.03 GB/s (RDMA), 59.91 GB/s (NVL)
^@DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 2, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 7, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 7, src RDMA: 1, src nvl: 5, start: 0, end: 0
...(similar logs)
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 2, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 3, meta: 0, 0, 0, 0
...(similar logs)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5b5696c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5b56915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5b56dcc918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x20d8e (0x7f5b56d92d8e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f5b56d94507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f5b56d9470f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f5b4e2d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f5b5694d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f5b5694633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5b569464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f5b4e58dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f5b4e58e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x560129f7a370 in /usr/bin/python)
frame #13: <unknown function> + 0x194588 (0x560129f8d588 in /usr/bin/python)
frame #14: <unknown function> + 0x19459c (0x560129f8d59c in /usr/bin/python)
frame #15: <unknown function> + 0x19459c (0x560129f8d59c in /usr/bin/python)
frame #16: <unknown function> + 0x1a04af (0x560129f994af in /usr/bin/python)
frame #17: <unknown function> + 0x15f986 (0x560129f58986 in /usr/bin/python)
frame #18: <unknown function> + 0x292ec7 (0x56012a08bec7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x54c7 (0x560129f73737 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x804 (0x560129f6ea74 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6bf (0x560129f6e92f in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x560129f8466c in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x18d3 (0x560129f6fb43 in /usr/bin/python)
frame #26: <unknown function> + 0x259f56 (0x56012a052f56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x56012a052e26 in /usr/bin/python)
frame #28: <unknown function> + 0x280808 (0x56012a079808 in /usr/bin/python)
frame #29: <unknown function> + 0x27b00f (0x56012a07400f in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x56012a06dd91 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x56012a06dc41 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x56012a06cf70 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x56012a046e6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f5b5774dd90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f5b5774de40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x56012a046d65 in /usr/bin/python)
W0513 06:52:49.383000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188788 via signal SIGTERM
W0513 06:52:49.384000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188789 via signal SIGTERM
W0513 06:52:49.385000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188790 via signal SIGTERM
W0513 06:52:49.387000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188791 via signal SIGTERM
W0513 06:52:49.388000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188792 via signal SIGTERM
W0513 06:52:49.391000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188793 via signal SIGTERM
W0513 06:52:49.393000 188723 torch/multiprocessing/spawn.py:169] Terminating process 188794 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 247, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 7 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 235, in test_loop
test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 179, in test_main
t = bench(lambda: buffer.dispatch(**tune_args))[0]
File "/mnt/yscfs/linjunxian/DeepEP/tests/utils.py", line 96, in bench
torch.cuda.synchronize()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 985, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
test_internode.py passed on 2 * 8 H20,
but timeout on 4 * 8 H20,
already modified
kNumWarpsPerGroupandkNumWarpGroupsin DeepEP/csrc/kernels/internode_ll.cu to8and4, respectively, (referring to #15 (comment)).script:
logs on node 1: