[Bug] Crash/Hang during CUDA graph capture on H100*2

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

It's highly possible to crash with segfault after 0.4.6.post2, 
~I didn't observe same issue before 0.4.6.post1.~
Sorry for misleading, It might be due to insufficient parallel testing in the older version. After repeating the tests, I found that version 0.4.6 also has this issue.

**Case 1:**

Crash on DeepGEMM:

```
2025-05-21 16:54:04 | 56%\|█████▋    \| 9/16 [00:12<00:10,  1.43s/it] 100%\|██████████\| 16/16 [00:12<00:00,  1.26it/s]
  |   | 2025-05-21 16:54:04 | 100%\|██████████\| 16/16 [00:14<00:00,  1.55it/s] 100%\|██████████\| 16/16 [00:14<00:00,  1.13it/s]
  |   | 2025-05-21 16:54:04 | 100%\|██████████\| 16/16 [00:13<00:00,  1.58it/s] 100%\|██████████\| 16/16 [00:13<00:00,  1.19it/s]
  |   | 2025-05-21 16:54:04 | [2025-05-21 07:54:04 TP0] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=272,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json.
  |   | 2025-05-21 16:54:13 | [sgl-v3-deepseek-nextn-0-1:119  :0:119] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55555d5eec40)
  |   | 2025-05-21 16:54:13 | ==== backtrace (tid:    119) ====
  |   | 2025-05-21 16:54:13 | 0 0x0000000000042520 __sigaction()  ???:0
  |   | 2025-05-21 16:54:13 | =================================
  |   | 2025-05-21 16:54:13 | Fatal Python error: Segmentation fault
  |   | 2025-05-21 16:54:13 |  
  |   | 2025-05-21 16:54:13 | Thread 0x00007feec7fff640 (most recent call first):
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 324 in wait
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 607 in wait
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |   | 2025-05-21 16:54:13 |  
  |   | 2025-05-21 16:54:13 | Thread 0x00007feed3fff640 (most recent call first):
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 324 in wait
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 607 in wait
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |   | 2025-05-21 16:54:13 |  
  |   | 2025-05-21 16:54:13 | Thread 0x00007ffb88aff640 (most recent call first):
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 953 in run
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |   | 2025-05-21 16:54:13 |  
  |   | 2025-05-21 16:54:13 | Current thread 0x00007ffff7c5c480 (most recent call first):
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 163 in all_gather
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 474 in _all_gather_into_tensor
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 150 in reg_all_gather_into_tensor
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 484 in all_gather_into_tensor
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 532 in all_gather
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 480 in _get_logits
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333 in forward
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1533 in forward
  |   | 2025-05-21 16:54:13 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 488 in run_once
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 500 in capture_one_batch_size
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 379 in capture
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 295 in __init__
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1028 in init_cuda_graphs
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 241 in initialize
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 192 in __init__
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78 in __init__
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 273 in __init__
  |   | 2025-05-21 16:54:13 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2255 in run_scheduler_process
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  |   | 2025-05-21 16:54:13 | File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  |   | 2025-05-21 16:54:13 | File "<string>", line 1 in <module>
  |   | 2025-05-21 16:54:13 |  

<br class="Apple-interchange-newline">
```

**Case 2:**

Crash on CUDA graph:

```
[2025-05-21 07:20:02 TP14] Capture draft cuda graph begin. This can take up to several minutes. avail mem=8.60 GB
2025-05-21 16:20:15	
[sgl-v3-deepseek-nextn-2-1:116  :0:116] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55555d5d7420)
2025-05-21 16:20:15	
==== backtrace (tid:    116) ====
2025-05-21 16:20:15	
 0 0x0000000000042520 __sigaction()  ???:0
2025-05-21 16:20:15	
=================================
2025-05-21 16:20:15	
Fatal Python error: Segmentation fault
2025-05-21 16:20:15	

2025-05-21 16:20:15	
Thread 0x00007feec7fff640 (most recent call first):
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 324 in wait
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 607 in wait
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15	

2025-05-21 16:20:15	
Thread 0x00007feed3fff640 (most recent call first):
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 324 in wait
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 607 in wait
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15	

2025-05-21 16:20:15	
Thread 0x00007ffb88aff640 (most recent call first):
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 953 in run
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
2025-05-21 16:20:15	

2025-05-21 16:20:15	
Current thread 0x00007ffff7c5c480 (most recent call first):
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 163 in all_gather
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 474 in _all_gather_into_tensor
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 150 in reg_all_gather_into_tensor
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 484 in all_gather_into_tensor
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 532 in all_gather
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 480 in _get_logits
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333 in forward
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 156 in forward
2025-05-21 16:20:15	
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 465 in draft_forward
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 156 in run_once
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 166 in capture_one_batch_size
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 379 in capture
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 103 in capture
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82 in __init__
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 223 in init_cuda_graphs
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 148 in __init__
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 286 in __init__
2025-05-21 16:20:15	
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2255 in run_scheduler_process
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
2025-05-21 16:20:15	
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
2025-05-21 16:20:15	
  File "<string>", line 1 in <module>
2025-05-21 16:20:15	
```

**Case 3:**

Both nodes hang:

<img width="1394" alt="Image" src="https://github.com/user-attachments/assets/8d64646d-c229-406c-a4bb-7d8518b48177" />

### Reproduction

I'm running on LWS with following parameters:

```
      spec:
        containers:
          - command:
              - /bin/sh
              - '-c'
              - >
                python3 -m sglang.launch_server --host 0.0.0.0 --port 8888
                --trust-remote-code --show-time-cost \
                    --model-path /tmp/scratch-space/DeepSeek-V3-0324 --tp 16 \
                    --dist-init-addr $(LWS_LEADER_ADDRESS):5000 --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) \
                    --speculative-algo EAGLE --speculative-eagle-topk 1 --speculative-num-steps 3 --speculative-num-draft-tokens 4 \
                    --max-running-requests 256 --tool-call-parser deepseekv3 \
                    --chat-template /sgl-workspace/sglang/examples/chat_template/tool_chat_template_deepseekv3.jinja
            env:
              - name: LWS_WORKER_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: >-
                      metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
            image: lmsysorg/sglang:v0.4.6.post4-cu124
```

### Environment

```
Python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 570.133.20
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post4
sgl_kernel: 0.1.2.post1
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0
numpy: 2.2.5
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.31.1
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.4
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.51.0
litellm: 1.69.1
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11  NIC12    NIC13   NIC14   NIC15   NIC16   NIC17   NIC18   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     SYS    NODE     NODE    SYS     NODE    SYS     SYS     SYS     0-55,112-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS    PIX      NODE    SYS     NODE    SYS     SYS     SYS     0-55,112-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS    NODE     PIX     SYS     NODE    SYS     SYS     SYS     0-55,112-167    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    SYS    NODE     NODE    SYS     PIX     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     PIX     NODE    NODE    56-111,168-223  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     NODE    PIX     NODE    56-111,168-223  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PIX     PIX     SYS     NODE   SYS      SYS     NODE    SYS     NODE    NODE    NODE    56-111,168-223  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     PIX    SYS      SYS     PIX     SYS     NODE    NODE    NODE    56-111,168-223  1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     SYS    NODE     NODE    SYS     NODE    SYS     SYS     SYS
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS    PIX      NODE    SYS     NODE    SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS    NODE     PIX     SYS     NODE    SYS     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      PIX     NODE    SYS     SYS     SYS     SYS     NODE    SYS    NODE     NODE    SYS     NODE    SYS     SYS     SYS
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX      X      NODE    SYS     SYS     SYS     SYS     NODE    SYS    NODE     NODE    SYS     NODE    SYS     SYS     SYS
NIC5    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE    SYS    NODE     NODE    SYS     PIX     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     PIX     NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     NODE    PIX     NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     NODE   SYS      SYS     NODE    SYS     NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     NODE   SYS      SYS     NODE    SYS     NODE    NODE    NODE
NIC10   PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      SYS    NODE     NODE    SYS     NODE    SYS     SYS     SYS
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS      X     SYS      SYS     PIX     SYS     NODE    NODE    NODE
NIC12   NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     X       NODE    SYS     NODE    SYS     SYS     SYS
NIC13   NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS    NODE      X      SYS     NODE    SYS     SYS     SYS
NIC14   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     PIX    SYS      SYS      X      SYS     NODE    NODE    NODE
NIC15   NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    SYS    NODE     NODE    SYS      X      SYS     SYS     SYS
NIC16   SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS      X      NODE    NODE
NIC17   SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     NODE     X      NODE
NIC18   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     NODE   SYS      SYS     NODE    SYS     NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17
  NIC18: mlx5_bond_0


ulimit soft: 1048576
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Crash/Hang during CUDA graph capture on H100*2 #6496

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Crash/Hang during CUDA graph capture on H100*2 #6496

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions