v0.5.12 DeepGemm regression on B300 (sm_103): CUDA_ERROR_ILLEGAL_ADDRESS in fp8_fp4_gemm_nt TMA descriptor init for shared-experts FP8 GEMM

## Human

```
Exception: Capture cuda graph failed: CUDA driver error (/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143): 700 (CUDA_ERROR_ILLEGAL_ADDRESS, an illegal memory access was encountered)
```

## AI Summary
`lmsysorg/sglang:v0.5.12-cu130` regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). GLM-5-FP8 inference aborts during CUDA graph capture with `CUDA_ERROR_ILLEGAL_ADDRESS (error 700)` in DeepGemm's TMA descriptor init. The same workload runs cleanly on `v0.5.11-cu130`.

## Environment
- **Image**: `lmsysorg/sglang:v0.5.12-cu130`
- **Hardware**: 8x NVIDIA B300 SXM6 (Blackwell, **sm_103**)
- **Driver / runtime**: NVIDIA driver 580.126.09, CUDA 13.0
- **Model**: `zai-org/GLM-5-FP8`

## Reproduction

Launch the SGLang server with these flags (this is the recipe that triggers it; same flags work on v0.5.11-cu130):

```bash
docker run --gpus all --shm-size=32g --rm \
  -v $HF_HUB_CACHE:/root/.cache/huggingface \
  lmsysorg/sglang:v0.5.12-cu130 \
  bash -c "
    pip install --no-deps 'transformers==5.2.0' 'huggingface-hub==1.4.1' && \
    export SGL_ENABLE_JIT_DEEPGEMM=1 && \
    python3 -m sglang.launch_server \
      --model-path=zai-org/GLM-5-FP8 \
      --host=0.0.0.0 --port=8888 --trust-remote-code \
      --tensor-parallel-size=8 \
      --data-parallel-size 1 --expert-parallel-size 1 \
      --tool-call-parser glm47 --reasoning-parser glm45 \
      --kv-cache-dtype fp8_e4m3 --quantization fp8 \
      --attention-backend nsa \
      --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
      --moe-runner-backend flashinfer_trtllm \
      --cuda-graph-max-bs 128 --max-running-requests 128 \
      --mem-fraction-static 0.85 \
      --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
      --enable-flashinfer-allreduce-fusion --disable-radix-cache \
      --stream-interval 30 \
      --model-loader-extra-config '{\"enable_multithread_load\": true}'
  "
```

Server loads the model, then aborts at the first CUDA graph capture iteration.

## Failing GitHub Action runs (full logs)
- Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25984496952
- Failing jobs (identical crash on all TP ranks):
  - https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25984496952/job/76379310468 (8k1k, conc-128)
  - https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25984496952/job/76379310585 (1k1k, conc-128)
- Source PR (image bump v0.5.11-cu130 → v0.5.12-cu130): https://github.com/SemiAnalysisAI/InferenceX/pull/1421

## Symptom
All TP workers crash simultaneously during CUDA graph capture on the **first** batch size processed. No prompt is served.

```
RuntimeError: Error in function 'run' at
/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143
CUDA_ERROR_ILLEGAL_ADDRESS (700)
```

## Call path
```
cuda_graph_runner.capture()
  → deepseek_v2.forward()
      → MoE layer forward_normal_dual_stream
          → _forward_shared_experts
              → shared_experts.gate_up_proj
                  → fp8_kernel.deep_gemm_fp8_fp8_bf16_nt
                      → deep_gemm_wrapper.gemm_nt_f8f8bf16
                          → deep_gemm.fp8_gemm_nt
                              → native fp8_fp4_gemm_nt
                                  → runtime_utils.hpp:143  ← CRASH
```

`runtime_utils.hpp:143` is the TMA descriptor validation/creation site, suggesting the regression is in how the bundled DeepGemm builds TMA descriptors for Blackwell (sm_120) in the FP8 shared-experts path.

## Working baseline
`lmsysorg/sglang:v0.5.11-cu130` on the same recipe / hardware / model runs cleanly through CUDA graph capture and full inference.

## Workarounds that unblock the workload
1. `--fp8-gemm-runner-backend cutlass` — bypasses DeepGemm via the CUTLASS FP8 path, runs to completion.
2. `--disable-cuda-graph` — avoids the capture path (large perf hit; smoke-test only).
3. Pin to `v0.5.11-cu130`.

## Suggested next step for maintainers
The diff between v0.5.11 and v0.5.12 in either the bundled DeepGemm vendoring or the shared-experts FP8 GEMM dispatch on Blackwell. Looking at `deep_gemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp` (TMA descriptor validation) — the addresses or strides passed to the TMA descriptor are likely the regression site.

Happy to provide additional traces or trigger more runs — let me know what's most useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.12 DeepGemm regression on B300 (sm_103): CUDA_ERROR_ILLEGAL_ADDRESS in fp8_fp4_gemm_nt TMA descriptor init for shared-experts FP8 GEMM #25551

Human

AI Summary

Environment

Reproduction

Failing GitHub Action runs (full logs)

Symptom

Call path

Working baseline

Workarounds that unblock the workload

Suggested next step for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

v0.5.12 DeepGemm regression on B300 (sm_103): CUDA_ERROR_ILLEGAL_ADDRESS in fp8_fp4_gemm_nt TMA descriptor init for shared-experts FP8 GEMM #25551

Description

Human

AI Summary

Environment

Reproduction

Failing GitHub Action runs (full logs)

Symptom

Call path

Working baseline

Workarounds that unblock the workload

Suggested next step for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions