Skip to content

v0.5.12 DeepGemm regression on B300 (sm_103): CUDA_ERROR_ILLEGAL_ADDRESS in fp8_fp4_gemm_nt TMA descriptor init for shared-experts FP8 GEMM #25551

@functionstackx

Description

@functionstackx

Human

Exception: Capture cuda graph failed: CUDA driver error (/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143): 700 (CUDA_ERROR_ILLEGAL_ADDRESS, an illegal memory access was encountered)

AI Summary

lmsysorg/sglang:v0.5.12-cu130 regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). GLM-5-FP8 inference aborts during CUDA graph capture with CUDA_ERROR_ILLEGAL_ADDRESS (error 700) in DeepGemm's TMA descriptor init. The same workload runs cleanly on v0.5.11-cu130.

Environment

  • Image: lmsysorg/sglang:v0.5.12-cu130
  • Hardware: 8x NVIDIA B300 SXM6 (Blackwell, sm_103)
  • Driver / runtime: NVIDIA driver 580.126.09, CUDA 13.0
  • Model: zai-org/GLM-5-FP8

Reproduction

Launch the SGLang server with these flags (this is the recipe that triggers it; same flags work on v0.5.11-cu130):

docker run --gpus all --shm-size=32g --rm \
  -v $HF_HUB_CACHE:/root/.cache/huggingface \
  lmsysorg/sglang:v0.5.12-cu130 \
  bash -c "
    pip install --no-deps 'transformers==5.2.0' 'huggingface-hub==1.4.1' && \
    export SGL_ENABLE_JIT_DEEPGEMM=1 && \
    python3 -m sglang.launch_server \
      --model-path=zai-org/GLM-5-FP8 \
      --host=0.0.0.0 --port=8888 --trust-remote-code \
      --tensor-parallel-size=8 \
      --data-parallel-size 1 --expert-parallel-size 1 \
      --tool-call-parser glm47 --reasoning-parser glm45 \
      --kv-cache-dtype fp8_e4m3 --quantization fp8 \
      --attention-backend nsa \
      --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
      --moe-runner-backend flashinfer_trtllm \
      --cuda-graph-max-bs 128 --max-running-requests 128 \
      --mem-fraction-static 0.85 \
      --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
      --enable-flashinfer-allreduce-fusion --disable-radix-cache \
      --stream-interval 30 \
      --model-loader-extra-config '{\"enable_multithread_load\": true}'
  "

Server loads the model, then aborts at the first CUDA graph capture iteration.

Failing GitHub Action runs (full logs)

Symptom

All TP workers crash simultaneously during CUDA graph capture on the first batch size processed. No prompt is served.

RuntimeError: Error in function 'run' at
/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143
CUDA_ERROR_ILLEGAL_ADDRESS (700)

Call path

cuda_graph_runner.capture()
  → deepseek_v2.forward()
      → MoE layer forward_normal_dual_stream
          → _forward_shared_experts
              → shared_experts.gate_up_proj
                  → fp8_kernel.deep_gemm_fp8_fp8_bf16_nt
                      → deep_gemm_wrapper.gemm_nt_f8f8bf16
                          → deep_gemm.fp8_gemm_nt
                              → native fp8_fp4_gemm_nt
                                  → runtime_utils.hpp:143  ← CRASH

runtime_utils.hpp:143 is the TMA descriptor validation/creation site, suggesting the regression is in how the bundled DeepGemm builds TMA descriptors for Blackwell (sm_120) in the FP8 shared-experts path.

Working baseline

lmsysorg/sglang:v0.5.11-cu130 on the same recipe / hardware / model runs cleanly through CUDA graph capture and full inference.

Workarounds that unblock the workload

  1. --fp8-gemm-runner-backend cutlass — bypasses DeepGemm via the CUTLASS FP8 path, runs to completion.
  2. --disable-cuda-graph — avoids the capture path (large perf hit; smoke-test only).
  3. Pin to v0.5.11-cu130.

Suggested next step for maintainers

The diff between v0.5.11 and v0.5.12 in either the bundled DeepGemm vendoring or the shared-experts FP8 GEMM dispatch on Blackwell. Looking at deep_gemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp (TMA descriptor validation) — the addresses or strides passed to the TMA descriptor are likely the regression site.

Happy to provide additional traces or trigger more runs — let me know what's most useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions