Skip to content

[Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) #25563

@functionstackx

Description

@functionstackx

human

lower confidence that this is an bug tbh idk.

Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
          Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
                  _c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
                  _rgTma_clmp_swiGlu_dynB_sm100f)

AI generate below

Summary

On NVIDIA B300 (Blackwell Ultra, sm_103), lmsysorg/sglang:v0.5.12-cu130 consistently crashes during the draft model's CUDA graph capture at the largest batch size for GLM-5-NVFP4 with EAGLE speculative decoding. The target model loads and captures graphs cleanly; draft capture then throws Exception: Capture cuda graph failed: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:276 — a trtllm batched-GEMM failure where the dispatched kernel is suffixed sm100f (Blackwell base + feature flag), being run on sm_103 hardware. Full kernel name + analysis below.

Same recipe pinned to lmsysorg/sglang:v0.5.11-cu130 works fine — only the v0.5.12 image regressed.

Environment

sglang image lmsysorg/sglang:v0.5.12-cu130
Hardware NVIDIA B300 (sm_103), 4× GPU per node
Model nvidia/GLM-5-NVFP4 (DeepseekV3ForCausalLMNextN, NVFP4)
Speculative decoding EAGLE, --speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4
Tensor parallelism TP=4, EP=1
Attention backend NSA (--attention-backend=nsa --nsa-decode-backend=trtllm --nsa-prefill-backend=trtllm)
MoE backend --moe-runner-backend=flashinfer_trtllm
KV cache dtype fp8_e4m3
Quantization fp8 (target), nvfp4 weights, modelopt_fp4 quant algo
Trigger batch size --cuda-graph-max-bs=128 --max-running-requests=128

Previously known-good image: lmsysorg/sglang:v0.5.11-cu130. v0.5.10.post1-cu130 was also fine.

Repro recipe

python3 -m sglang.launch_server \
  --model-path=nvidia/GLM-5-NVFP4 \
  --host=0.0.0.0 --port=8888 \
  --trust-remote-code \
  --tensor-parallel-size=4 --data-parallel-size 1 --expert-parallel-size 1 \
  --tool-call-parser glm47 --reasoning-parser glm45 \
  --kv-cache-dtype fp8_e4m3 --quantization fp8 \
  --attention-backend nsa \
  --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
  --moe-runner-backend flashinfer_trtllm \
  --cuda-graph-max-bs 128 --max-running-requests 128 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
  --enable-flashinfer-allreduce-fusion \
  --disable-radix-cache --stream-interval 30 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --model-loader-extra-config '{"enable_multithread_load": true}'

Observed timeline (per-TP rank, condensed from the worker log)

[ TP0..3] Capture cuda graph end. Time elapsed: 25.85 s. mem usage=2.52 GB. avail mem=34.62 GB.   ← target model OK
[ TP0..3] Load weight end. elapsed=4.92 s, type=DeepseekV3ForCausalLMNextN, quant=modelopt_fp4,
          quant_algo=NVFP4, avail mem=28.94 GB, mem usage=5.68 GB.                                ← draft model loads OK
[ TP0..3] KV Cache is allocated. #tokens: 2317696, KV size: 1.53 GB
[ TP0..3] Memory pool end. avail mem=27.41 GB
[ TP0]    Capture draft cuda graph begin. This can take up to several minutes. avail mem=28.30 GB
          Capturing batches (bs=128 avail_mem=28.11 GB):   0%|          | 0/35 [00:00<?, ?it/s]
[ TP*  ] Exception: Capture cuda graph failed: Error in function 'run' at
          /workspace/csrc/trtllm_batched_gemm_runner.cu:276:
          Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
          Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
                  _c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
                  _rgTma_clmp_swiGlu_dynB_sm100f)
benchmark_lib.sh: line 97: 1767290 Killed   python3 -m sglang.launch_server ...
Server died before becoming healthy. Exiting.

Workarounds we've tried

Workaround Result
Pin to lmsysorg/sglang:v0.5.11-cu130 ✅ works
--cuda-graph-max-bs 64 --max-running-requests 64 (cap below bs=128) Not yet tested, but likely sidesteps the trigger; suggested per KLAUD_DEBUG playbook
Disable EAGLE (drop speculative decoding) Likely works — only the draft graph capture crashes — but defeats the recipe's purpose
SGL_ENABLE_JIT_DEEPGEMM=0 Not tried — different code path

Source

Happy to attach a full server.log if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions