This appears to be triggered during execution of a kernel that initializes a TMA descriptor (same globalDim/globalStrides/boxDim shown in logs), and the error repeats multiple times before the process terminates.
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5680
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5100
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr: 0xffffffff5100
format 9
dim 3
gmem_address 0xfffd770e9c0c
globalDim (7168,119384,1,1,1)
globalStrides (2,14336,0,0,0)
boxDim (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 1
[rank0]:[E102 17:12:53.298703424 ProcessGroupNCCL.cpp:2057] [PG ID 5 PG GUID 21 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xfffb4459c700 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x224 (0xfffb44653574 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x4c (0xfffb451ee0fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x74 (0xfffb4520d404 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x770 (0xfffb45214460 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xc8 (0xfffb45215e18 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe1ae0 (0xfffdfc6d1ae0 in /usr/lib/aarch64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x8595c (0xfffdfece595c in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xebb0c (0xfffdfed4bb0c in /usr/lib/aarch64-linux-gnu/libc.so.6)
Checklist
Describe the bug
Describe the bug
When running SGLang disaggregation (prefill+decode) on GB300 (SM103) with TRTLLM MLA attention backend and with chunked-prefill-size=-1, the server intermittently crashes with repeated:
Error: Failed to initialize the TMA descriptor 1followed by a CUDA failure:
CUDA error: an illegal instruction was encountered(reported byProcessGroupNCCLwatchdog)This appears to be triggered during execution of a kernel that initializes a TMA descriptor (same
globalDim/globalStrides/boxDimshown in logs), and the error repeats multiple times before the process terminates.Reproduction
Environment
Python: 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GB300
GPU 0,1,2,3 Compute Capability: 10.3
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.95.05
PyTorch: 2.9.1+cu130
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.124.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV16 NV16 NV16 NODE NODE SYS SYS SYS SYS 0-710
2
GPU1 NV16 X NV16 NV16 NODE NODE SYS SYS SYS SYS 0-710
10
GPU2 NV16 NV16 X NV16 SYS SYS NODE NODE NODE NODE 72-143 1 18
GPU3 NV16 NV16 NV16 X SYS SYS NODE NODE NODE NODE 72-143 1 26
NIC0 NODE NODE SYS SYS X NODE SYS SYS SYS SYS
NIC1 NODE NODE SYS SYS NODE X SYS SYS SYS SYS
NIC2 SYS SYS NODE NODE SYS SYS X NODE NODE NODE
NIC3 SYS SYS NODE NODE SYS SYS NODE X NODE NODE
NIC4 SYS SYS NODE NODE SYS SYS NODE NODE X PIX
NIC5 SYS SYS NODE NODE SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
ulimit soft: 1048576