Skip to content

[Bug] Start fails both in dp or tp mode #6297

@u4lr451

Description

@u4lr451

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Start fails regardless of whether in Data Parallelism (DP) or pure Tensor Parallelism (TP) mode

Reproduction

main branch commitid : 3e350a9

tp mode

lauch commands (2 nodes)

TORCHINDUCTOR_FX_GRAPH_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=1 SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 64 128 130 --cuda-graph-max-bs 130 --attention-backend fa3 --model-path /sgl-workspace//DeepSeek-R1 --tp 16 --dist-init-addr ${HEAD_IP}:20000 --nnodes 2 --node-rank ${RANK} --trust-remote-code --host 0.0.0.0 --port 30000 --enable-ep-moe --max-running-requests 128 --mem-fraction-static 0.7 --speculative-draft /sgl-workspace/SGLang/DeepSeek-R1-NextN --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 4 --chunked-prefill-size 8192 --moe-dense-tp-size 1 --disable-overlap-schedule --disable-cuda-graph

errors:

[2025-05-14 12:08:50 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2271, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 640, in event_loop_normal
    result = self.run_batch(batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1528, in run_batch
    self.tp_worker.forward_batch_generation(model_worker_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 202, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1111, in forward
    ret = self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1071, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1531, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1446, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1224, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1323, in forward_ffn_with_scattered_input
    forward_batch.gathered_buffer[: forward_batch.input_ids.shape[0]],
TypeError: 'NoneType' object is not subscriptable

[2025-05-14 12:08:50] Received sigquit from a child process. It usually means the child failed.

dp mode

lanch commands: (2 nodes)

TORCHINDUCTOR_FX_GRAPH_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=1 SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 64 128 130 --cuda-graph-max-bs 130 --attention-backend fa3 --model-path /sgl-workspace//DeepSeek-R1 --tp 16 --dist-init-addr ${HEAD_IP}:20000 --nnodes 2 --node-rank ${RANK} --trust-remote-code --host 0.0.0.0 --port 30000 --enable-dp-attention --dp-size 16 --enable-ep-moe --max-running-requests 128 --mem-fraction-static 0.7 --speculative-draft /sgl-workspace/SGLang/DeepSeek-R1-NextN --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 4 --chunked-prefill-size 8192 --moe-dense-tp-size 1 --disable-overlap-schedule --disable-cuda-graph

errors:

[2025-05-14 12:00:03 DP2 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2271, in run_scheduler_proc
ess
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 640, in event_loop_normal
    result = self.run_batch(batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1528, in run_batch
    self.tp_worker.forward_batch_generation(model_worker_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 202, in forward_batch_gener
ation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1109, in forward
    ret = self.forward_decode(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1051, in forward_d
ecode
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1531, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1446, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1224, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1331, in forward_ffn_with_s
cattered_input
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 657, in forward
    return self.forward_absorb(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 780, in forward_absorb
    attn_output = self.attn_mqa(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 101, in forward
    return forward_batch.attn_backend.forward(  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 69, in forward
    return self.forward_decode(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashattention_backend.py", line 1099, in forward_decode
    page_table=self.forward_metadata_spec_decode_expand.page_table,
AttributeError: 'NoneType' object has no attribute 'page_table'

[2025-05-14 12:00:03 DP1 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2271, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 640, in event_loop_normal
    result = self.run_batch(batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1528, in run_batch
    self.tp_worker.forward_batch_generation(model_worker_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 202, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1109, in forward
    ret = self.forward_decode(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1051, in forward_decode
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1531, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1446, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1224, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1331, in forward_ffn_with_scattered_input
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 657, in forward
    return self.forward_absorb(
      File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 780, in forward_absorb
    attn_output = self.attn_mqa(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 101, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 69, in forward
    return self.forward_decode(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashattention_backend.py", line 1099, in forward_decode
    page_table=self.forward_metadata_spec_decode_expand.page_table,
AttributeError: 'NoneType' object has no attribute 'page_table'

Environment

Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post4
sgl_kernel: 0.1.2.post1
flashinfer_python: 0.2.5
triton: 3.2.0
transformers: 4.51.1
torchao: 0.10.0
numpy: 2.2.5
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.2
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.76.0
tiktoken: 0.9.0
anthropic: 0.50.0
litellm: 1.67.4.post1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 NIC17 NIC18 NIC19 NIC20 NIC21 NIC22 NIC23 NIC24 NIC25 NIC26 NIC27 NIC28 NIC29 NIC30 NIC31 NIC32 NIC33 NIC34 NIC35 NIC36 NIC37 NIC38 NIC39 NIC40 NIC41 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PIX NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS PHB NODE NODE PIX 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE PHB 96-191,288-383 1 N/A
NIC0 SYS SYS SYS SYS NODE NODE NODE NODE X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC1 SYS SYS SYS SYS NODE NODE NODE NODE PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC2 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC3 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC4 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC12 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC13 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC14 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC15 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC16 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC17 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC18 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC19 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC20 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC21 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC22 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC23 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC24 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC25 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC26 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC27 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC28 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC29 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIXPIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC30 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC31 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC32 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX X PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC33 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX X SYS SYS SYS SYS NODE NODE NODE NODE
NIC34 PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC35 NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS
NIC36 NODE PHB PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE NODE X PHB SYS SYS SYS SYS
NIC37 NODE PIX PHB NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS NODE NODE PHB X SYS SYS SYS SYS
NIC38 SYS SYS SYS SYS NODE NODE PHB PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE PHB
NIC39 SYS SYS SYS SYS NODE PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC40 SYS SYS SYS SYS PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X NODE
NIC41 SYS SYS SYS SYS NODE NODE PIX PHB NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS PHB NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: NIC41 CPU Affinity NUMA Affinity GPU NUMA ID
NIC1: MA ID
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NIC12: mlx5_12
NIC13: mlx5_13
NIC14: mlx5_14
NIC15: mlx5_15
NIC16: mlx5_16
NIC17: mlx5_17
NIC18: mlx5_18
NIC19: mlx5_19
NIC20: mlx5_20
NIC21: mlx5_21
NIC22: mlx5_22
NIC23: mlx5_23
NIC24: mlx5_24
NIC25: mlx5_25
NIC26: mlx5_26
NIC27: mlx5_27
NIC28: mlx5_28
NIC29: mlx5_29
NIC30: mlx5_30
NIC31: mlx5_31
NIC32: mlx5_32
NIC33: mlx5_33
NIC34: mlx5_bond_1
NIC35: mlx5_bond_2
NIC36: mlx5_bond_3
NIC37: mlx5_bond_4
NIC38: mlx5_bond_5
NIC39: mlx5_bond_6
NIC40: mlx5_bond_7
NIC41: mlx5_bond_8

ulimit soft: 1000000

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions