[Bug] TMA issue when running long context PD+MTP with chunked-prefill-size: -1 in Prefill worker

### Checklist

- [ ] I searched related issues but found no solution.
- [ ] The bug persists in the latest version.
- [ ] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [ ] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Describe the bug
### Describe the bug

When running **SGLang disaggregation (prefill+decode)** on **GB300 (SM103)** with **TRTLLM MLA attention backend** and with chunked-prefill-size=-1, the server intermittently crashes with repeated:

* `Error: Failed to initialize the TMA descriptor 1`

followed by a CUDA failure:

* `CUDA error: an illegal instruction was encountered` (reported by `ProcessGroupNCCL` watchdog)

This appears to be triggered during execution of a kernel that initializes a TMA descriptor (same `globalDim/globalStrides/boxDim` shown in logs), and the error repeats multiple times before the process terminates.


```
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5680
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5100
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0xffffffff5100
format         9
dim            3
gmem_address   0xfffd770e9c0c
globalDim      (7168,119384,1,1,1)
globalStrides  (2,14336,0,0,0)
boxDim         (32,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
[rank0]:[E102 17:12:53.298703424 ProcessGroupNCCL.cpp:2057] [PG ID 5 PG GUID 21 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xfffb4459c700 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x224 (0xfffb44653574 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x4c (0xfffb451ee0fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x74 (0xfffb4520d404 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x770 (0xfffb45214460 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xc8 (0xfffb45215e18 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe1ae0 (0xfffdfc6d1ae0 in /usr/lib/aarch64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x8595c (0xfffdfece595c in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xebb0c (0xfffdfed4bb0c in /usr/lib/aarch64-linux-gnu/libc.so.6)
```

### Reproduction

```

  decode_environment:
    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
    FLASHINFER_DISABLE_VERSION_CHECK: '1'
    FLASHINFER_WORKSPACE_BASE: /configs/flashinfer-cache
    MC_FORCE_MNNVL: '1'
    NCCL_CUMEM_ENABLE: '1'
    NCCL_MNNVL_ENABLE: '1'
    PYTHONUNBUFFERED: '1'
    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
    SGLANG_ENABLE_JIT_DEEPGEMM: 'false'
    SGLANG_FLASHINFER_FP4_GEMM_BACKEND: cutlass
    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
    SGLANG_ENABLE_SPEC_V2: '1'
    # DEBUG: Scale FutureMap buffer to test overwrite hypothesis (default=1, set 8/16 to test)
    SGLANG_FUTURE_MAP_SCALE: '8'

  prefill_environment:
    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
    FLASHINFER_DISABLE_VERSION_CHECK: '1'
    FLASHINFER_WORKSPACE_BASE: /configs/flashinfer-cache
    MC_FORCE_MNNVL: '1'
    NCCL_CUMEM_ENABLE: '1'
    NCCL_MNNVL_ENABLE: '1'
    PYTHONUNBUFFERED: '1'
    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
    SGLANG_ENABLE_JIT_DEEPGEMM: 'false'
    SGLANG_FLASHINFER_FP4_GEMM_BACKEND: cutlass
    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
    SGLANG_ENABLE_SPEC_V2: '1'

  sglang_config:
    decode:
      attention-backend: trtllm_mla
      chunked-prefill-size: -1
      context-length: 136000
      cuda-graph-max-bs: 256
      data-parallel-size: 4
      disable-radix-cache: true
      disaggregation-bootstrap-port: 30001
      disaggregation-mode: decode
      disaggregation-transfer-backend: nixl
      enable-dp-attention: true
      enable-symm-mem: true
      expert-parallel-size: 4
      kv-cache-dtype: fp8_e4m3
      mem-fraction-static: 0.85
      model-path: /model/
      moe-dense-tp-size: 1
      moe-runner-backend: flashinfer_trtllm
      prefill-round-robin-balance: true
      quantization: modelopt_fp4
      scheduler-recv-interval: 1
      served-model-name: nvidia/DeepSeek-R1-0528-NVFP4-v2
      speculative-algorithm: "EAGLE"
      speculative-num-steps: 2
      speculative-eagle-topk: 1
      speculative-num-draft-tokens: 3
      stream-interval: 10
      tensor-parallel-size: 4
      trust-remote-code: true
      watchdog-timeout: 1000000

    prefill:
      attention-backend: trtllm_mla
      chunked-prefill-size: -1
      context-length: 136000
      data-parallel-size: 1
      disable-radix-cache: true
      disaggregation-bootstrap-port: 30001
      disaggregation-mode: prefill
      disaggregation-transfer-backend: nixl
      enable-symm-mem: true
      expert-parallel-size: 1
      kv-cache-dtype: fp8_e4m3
      load-balance-method: round_robin
      max-running-requests: 16
      mem-fraction-static: 0.72
      model-path: /model/
      moe-dense-tp-size: 1
      moe-runner-backend: flashinfer_trtllm
      pipeline-parallel-size: 1
      quantization: modelopt_fp4
      scheduler-recv-interval: 1
      served-model-name: nvidia/DeepSeek-R1-0528-NVFP4-v2
      stream-interval: 10
      tensor-parallel-size: 4
      trust-remote-code: true
      watchdog-timeout: 1000000
      speculative-algorithm: "EAGLE"
      speculative-num-steps: 2
      speculative-eagle-topk: 1
      speculative-num-draft-tokens: 3
```

### Environment

Python: 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GB300
GPU 0,1,2,3 Compute Capability: 10.3
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.95.05
PyTorch: 2.9.1+cu130
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.124.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV16 NV16 NV16 NODE NODE SYS SYS SYS SYS 0-710
2
GPU1 NV16 X NV16 NV16 NODE NODE SYS SYS SYS SYS 0-710
10
GPU2 NV16 NV16 X NV16 SYS SYS NODE NODE NODE NODE 72-143 1 18
GPU3 NV16 NV16 NV16 X SYS SYS NODE NODE NODE NODE 72-143 1 26
NIC0 NODE NODE SYS SYS X NODE SYS SYS SYS SYS
NIC1 NODE NODE SYS SYS NODE X SYS SYS SYS SYS
NIC2 SYS SYS NODE NODE SYS SYS X NODE NODE NODE
NIC3 SYS SYS NODE NODE SYS SYS NODE X NODE NODE
NIC4 SYS SYS NODE NODE SYS SYS NODE NODE X PIX
NIC5 SYS SYS NODE NODE SYS SYS NODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5

ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] TMA issue when running long context PD+MTP with chunked-prefill-size: -1 in Prefill worker #16327

Checklist

Describe the bug

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] TMA issue when running long context PD+MTP with chunked-prefill-size: -1 in Prefill worker #16327

Description

Checklist

Describe the bug

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions