Human
Exception: Capture cuda graph failed: CUDA driver error (/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143): 700 (CUDA_ERROR_ILLEGAL_ADDRESS, an illegal memory access was encountered)
AI Summary
lmsysorg/sglang:v0.5.12-cu130 regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). GLM-5-FP8 inference aborts during CUDA graph capture with CUDA_ERROR_ILLEGAL_ADDRESS (error 700) in DeepGemm's TMA descriptor init. The same workload runs cleanly on v0.5.11-cu130.
Environment
- Image:
lmsysorg/sglang:v0.5.12-cu130
- Hardware: 8x NVIDIA B300 SXM6 (Blackwell, sm_103)
- Driver / runtime: NVIDIA driver 580.126.09, CUDA 13.0
- Model:
zai-org/GLM-5-FP8
Reproduction
Launch the SGLang server with these flags (this is the recipe that triggers it; same flags work on v0.5.11-cu130):
docker run --gpus all --shm-size=32g --rm \
-v $HF_HUB_CACHE:/root/.cache/huggingface \
lmsysorg/sglang:v0.5.12-cu130 \
bash -c "
pip install --no-deps 'transformers==5.2.0' 'huggingface-hub==1.4.1' && \
export SGL_ENABLE_JIT_DEEPGEMM=1 && \
python3 -m sglang.launch_server \
--model-path=zai-org/GLM-5-FP8 \
--host=0.0.0.0 --port=8888 --trust-remote-code \
--tensor-parallel-size=8 \
--data-parallel-size 1 --expert-parallel-size 1 \
--tool-call-parser glm47 --reasoning-parser glm45 \
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
--attention-backend nsa \
--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs 128 --max-running-requests 128 \
--mem-fraction-static 0.85 \
--chunked-prefill-size 32768 --max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion --disable-radix-cache \
--stream-interval 30 \
--model-loader-extra-config '{\"enable_multithread_load\": true}'
"
Server loads the model, then aborts at the first CUDA graph capture iteration.
Failing GitHub Action runs (full logs)
Symptom
All TP workers crash simultaneously during CUDA graph capture on the first batch size processed. No prompt is served.
RuntimeError: Error in function 'run' at
/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143
CUDA_ERROR_ILLEGAL_ADDRESS (700)
Call path
cuda_graph_runner.capture()
→ deepseek_v2.forward()
→ MoE layer forward_normal_dual_stream
→ _forward_shared_experts
→ shared_experts.gate_up_proj
→ fp8_kernel.deep_gemm_fp8_fp8_bf16_nt
→ deep_gemm_wrapper.gemm_nt_f8f8bf16
→ deep_gemm.fp8_gemm_nt
→ native fp8_fp4_gemm_nt
→ runtime_utils.hpp:143 ← CRASH
runtime_utils.hpp:143 is the TMA descriptor validation/creation site, suggesting the regression is in how the bundled DeepGemm builds TMA descriptors for Blackwell (sm_120) in the FP8 shared-experts path.
Working baseline
lmsysorg/sglang:v0.5.11-cu130 on the same recipe / hardware / model runs cleanly through CUDA graph capture and full inference.
Workarounds that unblock the workload
--fp8-gemm-runner-backend cutlass — bypasses DeepGemm via the CUTLASS FP8 path, runs to completion.
--disable-cuda-graph — avoids the capture path (large perf hit; smoke-test only).
- Pin to
v0.5.11-cu130.
Suggested next step for maintainers
The diff between v0.5.11 and v0.5.12 in either the bundled DeepGemm vendoring or the shared-experts FP8 GEMM dispatch on Blackwell. Looking at deep_gemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp (TMA descriptor validation) — the addresses or strides passed to the TMA descriptor are likely the regression site.
Happy to provide additional traces or trigger more runs — let me know what's most useful.
Human
AI Summary
lmsysorg/sglang:v0.5.12-cu130regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). GLM-5-FP8 inference aborts during CUDA graph capture withCUDA_ERROR_ILLEGAL_ADDRESS (error 700)in DeepGemm's TMA descriptor init. The same workload runs cleanly onv0.5.11-cu130.Environment
lmsysorg/sglang:v0.5.12-cu130zai-org/GLM-5-FP8Reproduction
Launch the SGLang server with these flags (this is the recipe that triggers it; same flags work on v0.5.11-cu130):
Server loads the model, then aborts at the first CUDA graph capture iteration.
Failing GitHub Action runs (full logs)
Symptom
All TP workers crash simultaneously during CUDA graph capture on the first batch size processed. No prompt is served.
Call path
runtime_utils.hpp:143is the TMA descriptor validation/creation site, suggesting the regression is in how the bundled DeepGemm builds TMA descriptors for Blackwell (sm_120) in the FP8 shared-experts path.Working baseline
lmsysorg/sglang:v0.5.11-cu130on the same recipe / hardware / model runs cleanly through CUDA graph capture and full inference.Workarounds that unblock the workload
--fp8-gemm-runner-backend cutlass— bypasses DeepGemm via the CUTLASS FP8 path, runs to completion.--disable-cuda-graph— avoids the capture path (large perf hit; smoke-test only).v0.5.11-cu130.Suggested next step for maintainers
The diff between v0.5.11 and v0.5.12 in either the bundled DeepGemm vendoring or the shared-experts FP8 GEMM dispatch on Blackwell. Looking at
deep_gemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp(TMA descriptor validation) — the addresses or strides passed to the TMA descriptor are likely the regression site.Happy to provide additional traces or trigger more runs — let me know what's most useful.