Checklist
Describe the bug
It's highly possible to crash with segfault after 0.4.6.post2,
I didn't observe same issue before 0.4.6.post1.
Sorry for misleading, It might be due to insufficient parallel testing in the older version. After repeating the tests, I found that version 0.4.6 also has this issue.
Case 1:
Crash on DeepGEMM:
2025-05-21 16:54:04 | 56%\|█████▋ \| 9/16 [00:12<00:10, 1.43s/it] 100%\|██████████\| 16/16 [00:12<00:00, 1.26it/s]
| | 2025-05-21 16:54:04 | 100%\|██████████\| 16/16 [00:14<00:00, 1.55it/s] 100%\|██████████\| 16/16 [00:14<00:00, 1.13it/s]
| | 2025-05-21 16:54:04 | 100%\|██████████\| 16/16 [00:13<00:00, 1.58it/s] 100%\|██████████\| 16/16 [00:13<00:00, 1.19it/s]
| | 2025-05-21 16:54:04 | [2025-05-21 07:54:04 TP0] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=272,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json.
| | 2025-05-21 16:54:13 | [sgl-v3-deepseek-nextn-0-1:119 :0:119] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55555d5eec40)
| | 2025-05-21 16:54:13 | ==== backtrace (tid: 119) ====
| | 2025-05-21 16:54:13 | 0 0x0000000000042520 __sigaction() ???:0
| | 2025-05-21 16:54:13 | =================================
| | 2025-05-21 16:54:13 | Fatal Python error: Segmentation fault
| | 2025-05-21 16:54:13 |
| | 2025-05-21 16:54:13 | Thread 0x00007feec7fff640 (most recent call first):
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 324 in wait
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 607 in wait
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
| | 2025-05-21 16:54:13 |
| | 2025-05-21 16:54:13 | Thread 0x00007feed3fff640 (most recent call first):
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 324 in wait
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 607 in wait
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
| | 2025-05-21 16:54:13 |
| | 2025-05-21 16:54:13 | Thread 0x00007ffb88aff640 (most recent call first):
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 953 in run
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
| | 2025-05-21 16:54:13 |
| | 2025-05-21 16:54:13 | Current thread 0x00007ffff7c5c480 (most recent call first):
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 163 in all_gather
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 474 in _all_gather_into_tensor
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 150 in reg_all_gather_into_tensor
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 484 in all_gather_into_tensor
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 532 in all_gather
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 480 in _get_logits
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333 in forward
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1533 in forward
| | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 488 in run_once
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 500 in capture_one_batch_size
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 379 in capture
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 295 in __init__
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1028 in init_cuda_graphs
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 241 in initialize
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 192 in __init__
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78 in __init__
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 273 in __init__
| | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2255 in run_scheduler_process
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
| | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
| | 2025-05-21 16:54:13 | File "<string>", line 1 in <module>
| | 2025-05-21 16:54:13 |
<br class="Apple-interchange-newline">
Case 2:
Crash on CUDA graph:
[2025-05-21 07:20:02 TP14] Capture draft cuda graph begin. This can take up to several minutes. avail mem=8.60 GB
2025-05-21 16:20:15
[sgl-v3-deepseek-nextn-2-1:116 :0:116] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55555d5d7420)
2025-05-21 16:20:15
==== backtrace (tid: 116) ====
2025-05-21 16:20:15
0 0x0000000000042520 __sigaction() ???:0
2025-05-21 16:20:15
=================================
2025-05-21 16:20:15
Fatal Python error: Segmentation fault
2025-05-21 16:20:15
2025-05-21 16:20:15
Thread 0x00007feec7fff640 (most recent call first):
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 324 in wait
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 607 in wait
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15
2025-05-21 16:20:15
Thread 0x00007feed3fff640 (most recent call first):
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 324 in wait
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 607 in wait
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15
2025-05-21 16:20:15
Thread 0x00007ffb88aff640 (most recent call first):
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 953 in run
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15
2025-05-21 16:20:15
Current thread 0x00007ffff7c5c480 (most recent call first):
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 163 in all_gather
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 474 in _all_gather_into_tensor
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 150 in reg_all_gather_into_tensor
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 484 in all_gather_into_tensor
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 532 in all_gather
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 480 in _get_logits
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333 in forward
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 156 in forward
2025-05-21 16:20:15
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 465 in draft_forward
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 156 in run_once
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 166 in capture_one_batch_size
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 379 in capture
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 103 in capture
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82 in __init__
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 223 in init_cuda_graphs
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 148 in __init__
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 286 in __init__
2025-05-21 16:20:15
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2255 in run_scheduler_process
2025-05-21 16:20:15
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
2025-05-21 16:20:15
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
2025-05-21 16:20:15
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
2025-05-21 16:20:15
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
2025-05-21 16:20:15
File "<string>", line 1 in <module>
2025-05-21 16:20:15
Case 3:
Both nodes hang:
Reproduction
I'm running on LWS with following parameters:
spec:
containers:
- command:
- /bin/sh
- '-c'
- >
python3 -m sglang.launch_server --host 0.0.0.0 --port 8888
--trust-remote-code --show-time-cost \
--model-path /tmp/scratch-space/DeepSeek-V3-0324 --tp 16 \
--dist-init-addr $(LWS_LEADER_ADDRESS):5000 --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) \
--speculative-algo EAGLE --speculative-eagle-topk 1 --speculative-num-steps 3 --speculative-num-draft-tokens 4 \
--max-running-requests 256 --tool-call-parser deepseekv3 \
--chat-template /sgl-workspace/sglang/examples/chat_template/tool_chat_template_deepseekv3.jinja
env:
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: >-
metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: lmsysorg/sglang:v0.4.6.post4-cu124
Environment
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 570.133.20
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post4
sgl_kernel: 0.1.2.post1
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0
numpy: 2.2.5
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.31.1
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.4
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.51.0
litellm: 1.69.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 NIC17 NIC18 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS PIX SYS NODE NODE SYS NODE SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS PIX NODE SYS NODE SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE PIX SYS NODE SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS NODE SYS NODE NODE SYS PIX SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE SYS NODE SYS SYS NODE SYS PIX NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE SYS NODE SYS SYS NODE SYS NODE PIX NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX PIX SYS NODE SYS SYS NODE SYS NODE NODE NODE 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE SYS PIX SYS SYS PIX SYS NODE NODE NODE 56-111,168-223 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS PIX SYS NODE NODE SYS NODE SYS SYS SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS PIX NODE SYS NODE SYS SYS SYS
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE PIX SYS NODE SYS SYS SYS
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE X PIX NODE SYS SYS SYS SYS NODE SYS NODE NODE SYS NODE SYS SYS SYS
NIC4 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE PIX X NODE SYS SYS SYS SYS NODE SYS NODE NODE SYS NODE SYS SYS SYS
NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS NODE SYS NODE NODE SYS PIX SYS SYS SYS
NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE SYS NODE SYS SYS NODE SYS PIX NODE NODE
NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE SYS NODE SYS SYS NODE SYS NODE PIX NODE
NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X PIX SYS NODE SYS SYS NODE SYS NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE PIX X SYS NODE SYS SYS NODE SYS NODE NODE NODE
NIC10 PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS X SYS NODE NODE SYS NODE SYS SYS SYS
NIC11 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE SYS X SYS SYS PIX SYS NODE NODE NODE
NIC12 NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS X NODE SYS NODE SYS SYS SYS
NIC13 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIX NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE X SYS NODE SYS SYS SYS
NIC14 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE SYS PIX SYS SYS X SYS NODE NODE NODE
NIC15 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS NODE SYS NODE NODE SYS X SYS SYS SYS
NIC16 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE SYS NODE SYS SYS NODE SYS X NODE NODE
NIC17 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE SYS NODE SYS SYS NODE SYS NODE X NODE
NIC18 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS SYS NODE SYS NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NIC12: mlx5_12
NIC13: mlx5_13
NIC14: mlx5_14
NIC15: mlx5_15
NIC16: mlx5_16
NIC17: mlx5_17
NIC18: mlx5_bond_0
ulimit soft: 1048576
Checklist
Describe the bug
It's highly possible to crash with segfault after 0.4.6.post2,
I didn't observe same issue before 0.4.6.post1.Sorry for misleading, It might be due to insufficient parallel testing in the older version. After repeating the tests, I found that version 0.4.6 also has this issue.
Case 1:
Crash on DeepGEMM:
Case 2:
Crash on CUDA graph:
Case 3:
Both nodes hang:
Reproduction
I'm running on LWS with following parameters:
Environment