```shell
Loading required modules: 100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [03:10<00:00, 31.74s/it]
[12-13 07:41:25] Pipelines instantiated
[12-13 07:41:25] Worker 0: Initialized device, model, and distributed environment.
[12-13 07:41:25] Worker 0: Scheduler loop started.
[12-13 07:41:25] Rank 0 scheduler listening on tcp://*:5638
Loading required modules: 100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [03:10<00:00, 31.77s/it]
Loading required modules: 100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [03:11<00:00, 31.84s/it]
Loading required modules: 100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [03:12<00:00, 32.07s/it]
[12-13 07:41:27] Sampling params:
width: -1
height: -1
num_frames: 125
prompt: A young woman with long blonde hair, wearing a white t-shirt and blue jeans, walking through a sunny park with green trees in the background, carrying a brown shoulder bag, followed by a smooth tracking shot that moves alongside her as she walks.
neg_prompt: 色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走
seed: 1024
infer_steps: 40
num_outputs_per_prompt: 1
guidance_scale: 4.0
embedded_guidance_scale: 6.0
n_tokens: -1
flow_shift: 12.0
image_path: None
save_output: True
output_file_path: outputs/A_young_woman_with_long_blonde_hair_wearing_a_white_t-shirt_and_blue_jeans_walking_through_a_sunny_20251213-074127_4269cbf7.mp4
[12-13 07:41:27] Processing prompt 1/1: A young woman with long blonde hair, wearing a white t-shirt and blue jeans, walking through a sunny
/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/utils/distributed.py:34: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)
tensor_data = torch.ByteTensor(
[12-13 07:41:27] Creating pipeline stages...
[12-13 07:41:27] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[12-13 07:41:27] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-13 07:41:27] Profiling request: mocked_fake_id_for_offline_generate for 5 steps...
[12-13 07:41:27] Starting Profiler...
[12-13 07:41:27] [InputValidationStage] started...
[12-13 07:41:27] [InputValidationStage] finished in 0.0043 seconds
[12-13 07:41:27] [TextEncodingStage] started...
WARNING 12-13 07:41:31 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 12-13 07:41:31 [block_adapters.py:229] Auto fill blocks_name: ['blocks', 'blocks'].
INFO 12-13 07:41:31 [block_adapters.py:162] Found transformer NOT from diffusers: sglang.multimodal_gen.runtime.models.dits.wanvideo disable check_forward_pattern by default.
INFO 12-13 07:41:31 [cache_interface.py:200] cache_config is None, using default DBCacheConfig
INFO 12-13 07:41:31 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [cache_adapter.py:134] Use custom 'enable_separate_cfg' from BlockAdapter: True. Pipeline: FakeDiffusionPipeline.
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W4I1M0MC8_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W2I1M0MC20_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140418593588720, context_manager: FakeDiffusionPipeline_140417988320464.
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140418588350976, context_manager: FakeDiffusionPipeline_140417988320464.
WARNING 12-13 07:41:31 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 12-13 07:41:31 [block_adapters.py:229] Auto fill blocks_name: ['blocks', 'blocks'].
INFO 12-13 07:41:31 [block_adapters.py:162] Found transformer NOT from diffusers: sglang.multimodal_gen.runtime.models.dits.wanvideo disable check_forward_pattern by default.
INFO 12-13 07:41:31 [cache_interface.py:200] cache_config is None, using default DBCacheConfig
INFO 12-13 07:41:31 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [cache_adapter.py:134] Use custom 'enable_separate_cfg' from BlockAdapter: True. Pipeline: FakeDiffusionPipeline.
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W4I1M0MC8_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W2I1M0MC20_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140336518225488, context_manager: FakeDiffusionPipeline_140334574005872.
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140334574005152, context_manager: FakeDiffusionPipeline_140334574005872.
WARNING 12-13 07:41:31 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 12-13 07:41:31 [block_adapters.py:229] Auto fill blocks_name: ['blocks', 'blocks'].
INFO 12-13 07:41:31 [block_adapters.py:162] Found transformer NOT from diffusers: sglang.multimodal_gen.runtime.models.dits.wanvideo disable check_forward_pattern by default.
INFO 12-13 07:41:31 [cache_interface.py:200] cache_config is None, using default DBCacheConfig
INFO 12-13 07:41:31 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [cache_adapter.py:134] Use custom 'enable_separate_cfg' from BlockAdapter: True. Pipeline: FakeDiffusionPipeline.
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W4I1M0MC8_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W2I1M0MC20_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_139832392396256, context_manager: FakeDiffusionPipeline_139831119788416.
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_139831119936208, context_manager: FakeDiffusionPipeline_139831119788416.
[12-13 07:41:31] [TextEncodingStage] finished in 3.5739 seconds
[12-13 07:41:31] [ConditioningStage] started...
[12-13 07:41:31] [ConditioningStage] finished in 0.0001 seconds
[12-13 07:41:31] [TimestepPreparationStage] started...
[12-13 07:41:31] [TimestepPreparationStage] finished in 0.0036 seconds
[12-13 07:41:31] [LatentPreparationStage] started...
[12-13 07:41:31] [LatentPreparationStage] finished in 0.0021 seconds
[12-13 07:41:31] [DenoisingStage] started...
[12-13 07:41:31] cache-dit is running in distributed environment (world_size=4). Using local caching strategy: each GPU caches its own activation shards. This may be less accurate than single-GPU caching but provides speedup.
[12-13 07:41:31] SCM: generated mask with 22 compute steps, 18 cache steps (preset=medium)
[12-13 07:41:31] Enabling cache-dit on wan2.2 dual transformers with BlockAdapter
[12-13 07:41:31] Primary (transformer): Fn=1, Bn=0, W=4, R=0.24, MC=8, TaylorSeer=False
[12-13 07:41:31] Secondary (transformer_2): Fn=1, Bn=0, W=2, R=0.24, MC=20, TaylorSeer=False
[12-13 07:41:31] SCM enabled: 22 compute steps, 18 cache steps, policy=dynamic
WARNING 12-13 07:41:31 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 12-13 07:41:31 [block_adapters.py:229] Auto fill blocks_name: ['blocks', 'blocks'].
INFO 12-13 07:41:31 [block_adapters.py:162] Found transformer NOT from diffusers: sglang.multimodal_gen.runtime.models.dits.wanvideo disable check_forward_pattern by default.
INFO 12-13 07:41:31 [cache_interface.py:200] cache_config is None, using default DBCacheConfig
INFO 12-13 07:41:31 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
WARNING 12-13 07:41:31 [block_adapters.py:478] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [cache_adapter.py:134] Use custom 'enable_separate_cfg' from BlockAdapter: True. Pipeline: FakeDiffusionPipeline.
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W4I1M0MC8_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
INFO 12-13 07:41:31 [cache_adapter.py:307] Collected Context Config: DBCache_F1B0_W2I1M0MC20_R0.24_SCM1111111111001110000011100000111000110001_dynamic, Calibrator Config: None
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140168208559296, context_manager: FakeDiffusionPipeline_140164652767888.
WARNING 12-13 07:41:31 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_2
INFO 12-13 07:41:31 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for blocks, cache_context: blocks_140165252596704, context_manager: FakeDiffusionPipeline_140164652767888.
[12-13 07:41:31] cache-dit enabled on dual transformers (steps=40)
0%| | 0/40 [00:29<?, ?it/s]
[12-13 07:42:01] [DenoisingStage] Error during execution after 29571.3200 ms: The size of tensor a (135) must match the size of tensor b (134) at non-singleton dimension 4
Traceback (most recent call last):
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py", line 90, in _execute
batch = stage(batch, server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py", line 192, in __call__
result = self.forward(batch, server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1010, in forward
latents = self.scheduler.step(
^^^^^^^^^^^^^^^^^^^^
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/models/schedulers/scheduling_flow_unipc_multistep.py", line 745, in step
model_output_convert = self.convert_model_output(model_output, sample=sample)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/models/schedulers/scheduling_flow_unipc_multistep.py", line 333, in convert_model_output
x0_pred = sample - sigma_t * model_output
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (135) must match the size of tensor b (134) at non-singleton dimension 4
[12-13 07:42:01] Stopping Profiler...
[12-13 07:42:01] Saving profiler traces to: /nas/bbuf/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz
[12-13 07:42:01] Failed to generate output for prompt 1: Error executing request mocked_fake_id_for_offline_generate: The size of tensor a (135) must match the size of tensor b (134) at non-singleton dimension 4
Traceback (most recent call last):
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/utils/logging_utils.py", line 495, in log_generation_timer
yield timer
File "/nas/bbuf/sglang/python/sglang/multimodal_gen/runtime/entrypoints/diffusion_generator.py", line 273, in generate
raise Exception(f"{output_batch.error}")
Exception: Error executing request mocked_fake_id_for_offline_generate: The size of tensor a (135) must match the size of tensor b (134) at non-singleton dimension 4
[12-13 07:42:01] Completed batch processing. Generated 0 outputs in 33.90 seconds.
[12-13 07:42:01] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
### Reproduction
```shell
SGLANG_CACHE_DIT_ENABLED=true SGLANG_CACHE_DIT_WARMUP=4 SGLANG_CACHE_DIT_MC=8 SGLANG_CACHE_DIT_RDT=0.24 SGLANG_CACHE_DIT_FN=1 SGLANG_CACHE_DIT_BN=0 SGLANG_CACHE_DIT_SECONDARY_WARMUP=2 SGLANG_CACHE_DIT_SECONDARY_MC=20 SGLANG_CACHE_DIT_SECONDARY_RDT=0.24 SGLANG_CACHE_DIT_SECONDARY_FN=1 SGLANG_CACHE_DIT_SECONDARY_BN=0 SGLANG_CACHE_DIT_SCM_PRESET=medium SGLANG_CACHE_DIT_SCM_POLICY=dynamic sglang generate --model-path /nas/bbuf/Wan2.2-T2V-A14B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree 4 --profile --prompt "A young woman with long blonde hair, wearing a white t-shirt and blue jeans, walking through a sunny park with green trees in the background, carrying a brown shoulder bag, followed by a smooth tracking shot that moves alongside her as she walks."
root@d896cfd062bd:/nas/bbuf/sglang# python3 -m sglang.check_env
Python: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.48
CUDA Driver Version: 550.127.05
PyTorch: 2.9.1+cu128
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.14
fastapi: 0.124.2
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS PIX NODE NODE NODE SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS NODE PIX NODE NODE SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS NODE NODE PIX NODE SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE SYS NODE NODE NODE PIX SYS SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS NODE SYS SYS SYS SYS PIX NODENODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE SYS SYS SYS SYS NODE PIX NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE SYS SYS SYS SYS NODE NODEPIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS PIX SYS SYS SYS SYS NODE NODENODE 48-95,144-191 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PXB SYS NODE NODE NODE NODE SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PXB X SYS NODE NODE NODE NODE SYS SYS SYS
NIC2 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS X SYS SYS SYS SYS NODE NODENODE
NIC3 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS X NODE NODE NODE SYS SYS SYS
NIC4 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE SYS NODE X NODE NODE SYS SYS SYS
NIC5 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE SYS NODE NODE X NODE SYS SYS SYS
NIC6 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE SYS NODE NODE NODE X SYS SYS SYS
NIC7 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS NODE SYS SYS SYS SYS X NODENODE
NIC8 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS NODE SYS SYS SYS SYS NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE SYS SYS SYS SYS NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_25_0
NIC1: mlx5_200_0
NIC2: mlx5_400_0
NIC3: mlx5_400_1
NIC4: mlx5_400_2
NIC5: mlx5_400_3
NIC6: mlx5_400_4
NIC7: mlx5_400_5
NIC8: mlx5_400_6
NIC9: mlx5_400_7
ulimit soft: 1048576
Checklist
Describe the bug
sglang generate --model-path /nas/bbuf/Wan2.2-T2V-A14B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree 4 --profile --prompt "A young woman with long blonde hair, wearing a white t-shirt and blue jeans, walking through a sunny park with green trees in the background, carrying a brown shoulder bag, followed by a smooth tracking shot that moves alongside her as she walks."Environment
root@d896cfd062bd:/nas/bbuf/sglang# python3 -m sglang.check_env Python: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 13.0, V13.0.48 CUDA Driver Version: 550.127.05 PyTorch: 2.9.1+cu128 sglang: 0.5.6.post2 sgl_kernel: 0.3.19 flashinfer_python: 0.5.3 flashinfer_cubin: 0.5.3 flashinfer_jit_cache: Module Not Found triton: 3.5.1 transformers: 4.57.1 torchao: 0.9.0 numpy: 1.26.4 aiohttp: 3.12.14 fastapi: 0.124.2 hf_transfer: 0.1.9 huggingface_hub: 0.36.0 interegular: 0.3.3 modelscope: 1.33.0 orjson: 3.11.5 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0 pydantic: 2.11.7 python-multipart: 0.0.20 pyzmq: 27.0.1 uvicorn: 0.38.0 uvloop: 0.22.1 vllm: Module Not Found xgrammar: 0.1.27 openai: 2.6.1 tiktoken: 0.12.0 anthropic: 0.75.0 litellm: Module Not Found decord2: 2.0.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8NIC9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS PIX NODE NODE NODE SYS SYS SYS 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS NODE PIX NODE NODE SYS SYS SYS 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS NODE NODE PIX NODE SYS SYS SYS 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE SYS NODE NODE NODE PIX SYS SYS SYS 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS NODE SYS SYS SYS SYS PIX NODENODE 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE SYS SYS SYS SYS NODE PIX NODE 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE SYS SYS SYS SYS NODE NODEPIX 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS PIX SYS SYS SYS SYS NODE NODENODE 48-95,144-191 1 N/A NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PXB SYS NODE NODE NODE NODE SYS SYS SYS NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PXB X SYS NODE NODE NODE NODE SYS SYS SYS NIC2 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS X SYS SYS SYS SYS NODE NODENODE NIC3 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS X NODE NODE NODE SYS SYS SYS NIC4 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE SYS NODE X NODE NODE SYS SYS SYS NIC5 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE SYS NODE NODE X NODE SYS SYS SYS NIC6 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE SYS NODE NODE NODE X SYS SYS SYS NIC7 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS NODE SYS SYS SYS SYS X NODENODE NIC8 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS NODE SYS SYS SYS SYS NODE X NODE NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE SYS SYS SYS SYS NODE NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_25_0 NIC1: mlx5_200_0 NIC2: mlx5_400_0 NIC3: mlx5_400_1 NIC4: mlx5_400_2 NIC5: mlx5_400_3 NIC6: mlx5_400_4 NIC7: mlx5_400_5 NIC8: mlx5_400_6 NIC9: mlx5_400_7 ulimit soft: 1048576