human
lower confidence that this is an bug tbh idk.
Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
_c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
_rgTma_clmp_swiGlu_dynB_sm100f)
AI generate below
Summary
On NVIDIA B300 (Blackwell Ultra, sm_103), lmsysorg/sglang:v0.5.12-cu130 consistently crashes during the draft model's CUDA graph capture at the largest batch size for GLM-5-NVFP4 with EAGLE speculative decoding. The target model loads and captures graphs cleanly; draft capture then throws Exception: Capture cuda graph failed: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:276 — a trtllm batched-GEMM failure where the dispatched kernel is suffixed sm100f (Blackwell base + feature flag), being run on sm_103 hardware. Full kernel name + analysis below.
Same recipe pinned to lmsysorg/sglang:v0.5.11-cu130 works fine — only the v0.5.12 image regressed.
Environment
|
|
| sglang image |
lmsysorg/sglang:v0.5.12-cu130 |
| Hardware |
NVIDIA B300 (sm_103), 4× GPU per node |
| Model |
nvidia/GLM-5-NVFP4 (DeepseekV3ForCausalLMNextN, NVFP4) |
| Speculative decoding |
EAGLE, --speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4 |
| Tensor parallelism |
TP=4, EP=1 |
| Attention backend |
NSA (--attention-backend=nsa --nsa-decode-backend=trtllm --nsa-prefill-backend=trtllm) |
| MoE backend |
--moe-runner-backend=flashinfer_trtllm |
| KV cache dtype |
fp8_e4m3 |
| Quantization |
fp8 (target), nvfp4 weights, modelopt_fp4 quant algo |
| Trigger batch size |
--cuda-graph-max-bs=128 --max-running-requests=128 |
Previously known-good image: lmsysorg/sglang:v0.5.11-cu130. v0.5.10.post1-cu130 was also fine.
Repro recipe
python3 -m sglang.launch_server \
--model-path=nvidia/GLM-5-NVFP4 \
--host=0.0.0.0 --port=8888 \
--trust-remote-code \
--tensor-parallel-size=4 --data-parallel-size 1 --expert-parallel-size 1 \
--tool-call-parser glm47 --reasoning-parser glm45 \
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
--attention-backend nsa \
--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs 128 --max-running-requests 128 \
--mem-fraction-static 0.85 \
--chunked-prefill-size 32768 --max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion \
--disable-radix-cache --stream-interval 30 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--model-loader-extra-config '{"enable_multithread_load": true}'
Observed timeline (per-TP rank, condensed from the worker log)
[ TP0..3] Capture cuda graph end. Time elapsed: 25.85 s. mem usage=2.52 GB. avail mem=34.62 GB. ← target model OK
[ TP0..3] Load weight end. elapsed=4.92 s, type=DeepseekV3ForCausalLMNextN, quant=modelopt_fp4,
quant_algo=NVFP4, avail mem=28.94 GB, mem usage=5.68 GB. ← draft model loads OK
[ TP0..3] KV Cache is allocated. #tokens: 2317696, KV size: 1.53 GB
[ TP0..3] Memory pool end. avail mem=27.41 GB
[ TP0] Capture draft cuda graph begin. This can take up to several minutes. avail mem=28.30 GB
Capturing batches (bs=128 avail_mem=28.11 GB): 0%| | 0/35 [00:00<?, ?it/s]
[ TP* ] Exception: Capture cuda graph failed: Error in function 'run' at
/workspace/csrc/trtllm_batched_gemm_runner.cu:276:
Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
_c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
_rgTma_clmp_swiGlu_dynB_sm100f)
benchmark_lib.sh: line 97: 1767290 Killed python3 -m sglang.launch_server ...
Server died before becoming healthy. Exiting.
Workarounds we've tried
| Workaround |
Result |
Pin to lmsysorg/sglang:v0.5.11-cu130 |
✅ works |
--cuda-graph-max-bs 64 --max-running-requests 64 (cap below bs=128) |
Not yet tested, but likely sidesteps the trigger; suggested per KLAUD_DEBUG playbook |
| Disable EAGLE (drop speculative decoding) |
Likely works — only the draft graph capture crashes — but defeats the recipe's purpose |
SGL_ENABLE_JIT_DEEPGEMM=0 |
Not tried — different code path |
Source
Happy to attach a full server.log if useful.
human
lower confidence that this is an bug tbh idk.
AI generate below
Summary
On NVIDIA B300 (Blackwell Ultra,
sm_103),lmsysorg/sglang:v0.5.12-cu130consistently crashes during the draft model's CUDA graph capture at the largest batch size for GLM-5-NVFP4 with EAGLE speculative decoding. The target model loads and captures graphs cleanly; draft capture then throwsException: Capture cuda graph failed: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:276— a trtllm batched-GEMM failure where the dispatched kernel is suffixedsm100f(Blackwell base + feature flag), being run onsm_103hardware. Full kernel name + analysis below.Same recipe pinned to
lmsysorg/sglang:v0.5.11-cu130works fine — only the v0.5.12 image regressed.Environment
lmsysorg/sglang:v0.5.12-cu130sm_103), 4× GPU per nodenvidia/GLM-5-NVFP4(DeepseekV3ForCausalLMNextN, NVFP4)--speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4--attention-backend=nsa --nsa-decode-backend=trtllm --nsa-prefill-backend=trtllm)--moe-runner-backend=flashinfer_trtllmfp8_e4m3fp8(target), nvfp4 weights, modelopt_fp4 quant algo--cuda-graph-max-bs=128 --max-running-requests=128Previously known-good image:
lmsysorg/sglang:v0.5.11-cu130. v0.5.10.post1-cu130 was also fine.Repro recipe
Observed timeline (per-TP rank, condensed from the worker log)
Workarounds we've tried
lmsysorg/sglang:v0.5.11-cu130--cuda-graph-max-bs 64 --max-running-requests 64(cap below bs=128)SGL_ENABLE_JIT_DEEPGEMM=0Source
glm5-fp4-b300-sglang/glm5-fp4-b300-sglang-mtpin https://github.com/SemiAnalysisAI/InferenceX/blob/main/.github/configs/nvidia-master.yamlHappy to attach a full
server.logif useful.