[Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works)

# human 

lower confidence that this is an bug tbh idk.
```
Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
          Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
                  _c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
                  _rgTma_clmp_swiGlu_dynB_sm100f)
```


# AI generate below

## Summary

On NVIDIA B300 (Blackwell Ultra, `sm_103`), `lmsysorg/sglang:v0.5.12-cu130` consistently crashes during the **draft model's CUDA graph capture at the largest batch size** for GLM-5-NVFP4 with EAGLE speculative decoding. The target model loads and captures graphs cleanly; draft capture then throws `Exception: Capture cuda graph failed: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:276` — a trtllm batched-GEMM failure where the dispatched kernel is suffixed `sm100f` (Blackwell base + feature flag), being run on `sm_103` hardware. Full kernel name + analysis below.

Same recipe pinned to `lmsysorg/sglang:v0.5.11-cu130` works fine — only the v0.5.12 image regressed.

## Environment

| | |
|---|---|
| sglang image | `lmsysorg/sglang:v0.5.12-cu130` |
| Hardware | NVIDIA B300 (`sm_103`), 4× GPU per node |
| Model | `nvidia/GLM-5-NVFP4` (DeepseekV3ForCausalLMNextN, NVFP4) |
| Speculative decoding | EAGLE, `--speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4` |
| Tensor parallelism | TP=4, EP=1 |
| Attention backend | NSA (`--attention-backend=nsa --nsa-decode-backend=trtllm --nsa-prefill-backend=trtllm`) |
| MoE backend | `--moe-runner-backend=flashinfer_trtllm` |
| KV cache dtype | `fp8_e4m3` |
| Quantization | `fp8` (target), nvfp4 weights, modelopt_fp4 quant algo |
| Trigger batch size | `--cuda-graph-max-bs=128 --max-running-requests=128` |

Previously known-good image: `lmsysorg/sglang:v0.5.11-cu130`. v0.5.10.post1-cu130 was also fine.

## Repro recipe

```
python3 -m sglang.launch_server \
  --model-path=nvidia/GLM-5-NVFP4 \
  --host=0.0.0.0 --port=8888 \
  --trust-remote-code \
  --tensor-parallel-size=4 --data-parallel-size 1 --expert-parallel-size 1 \
  --tool-call-parser glm47 --reasoning-parser glm45 \
  --kv-cache-dtype fp8_e4m3 --quantization fp8 \
  --attention-backend nsa \
  --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
  --moe-runner-backend flashinfer_trtllm \
  --cuda-graph-max-bs 128 --max-running-requests 128 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
  --enable-flashinfer-allreduce-fusion \
  --disable-radix-cache --stream-interval 30 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --model-loader-extra-config '{"enable_multithread_load": true}'
```

## Observed timeline (per-TP rank, condensed from the worker log)

```
[ TP0..3] Capture cuda graph end. Time elapsed: 25.85 s. mem usage=2.52 GB. avail mem=34.62 GB.   ← target model OK
[ TP0..3] Load weight end. elapsed=4.92 s, type=DeepseekV3ForCausalLMNextN, quant=modelopt_fp4,
          quant_algo=NVFP4, avail mem=28.94 GB, mem usage=5.68 GB.                                ← draft model loads OK
[ TP0..3] KV Cache is allocated. #tokens: 2317696, KV size: 1.53 GB
[ TP0..3] Memory pool end. avail mem=27.41 GB
[ TP0]    Capture draft cuda graph begin. This can take up to several minutes. avail mem=28.30 GB
          Capturing batches (bs=128 avail_mem=28.11 GB):   0%|          | 0/35 [00:00<?, ?it/s]
[ TP*  ] Exception: Capture cuda graph failed: Error in function 'run' at
          /workspace/csrc/trtllm_batched_gemm_runner.cu:276:
          Error occurred when running GEMM! (numBatches: 256, GemmMNK: 128 1024 6144,
          Kernel: bmm_Bfloat16_Bfloat16Bfloat16_Fp32_t128x8x128u2_s6_et128x8_m128x8x16
                  _c1x1x1_16dp256b_rM_BN_transOut_schPd2x1x2x3_bN_ldgsts_ldgstsSf
                  _rgTma_clmp_swiGlu_dynB_sm100f)
benchmark_lib.sh: line 97: 1767290 Killed   python3 -m sglang.launch_server ...
Server died before becoming healthy. Exiting.
```


## Workarounds we've tried

| Workaround | Result |
|---|---|
| Pin to `lmsysorg/sglang:v0.5.11-cu130` | ✅ works |
| `--cuda-graph-max-bs 64 --max-running-requests 64` (cap below bs=128) | Not yet tested, but likely sidesteps the trigger; suggested per KLAUD_DEBUG playbook |
| Disable EAGLE (drop speculative decoding) | Likely works — only the draft graph capture crashes — but defeats the recipe's purpose |
| `SGL_ENABLE_JIT_DEEPGEMM=0` | Not tried — different code path |

## Source

- Failing CI run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25984497319
- Reproducer recipe (script): https://github.com/SemiAnalysisAI/InferenceX/blob/main/benchmarks/single_node/glm5_fp4_b300_mtp.sh
- Master config entry: `glm5-fp4-b300-sglang` / `glm5-fp4-b300-sglang-mtp` in https://github.com/SemiAnalysisAI/InferenceX/blob/main/.github/configs/nvidia-master.yaml
- Tracking PR (image bump that exposed the regression): https://github.com/SemiAnalysisAI/InferenceX/pull/1420

Happy to attach a full `server.log` if useful.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) #25563

human

AI generate below

Summary

Environment

Repro recipe

Observed timeline (per-TP rank, condensed from the worker log)

Workarounds we've tried

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


sglang image	`lmsysorg/sglang:v0.5.12-cu130`
Hardware	NVIDIA B300 (`sm_103`), 4× GPU per node
Model	`nvidia/GLM-5-NVFP4` (DeepseekV3ForCausalLMNextN, NVFP4)
Speculative decoding	EAGLE, `--speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4`
Tensor parallelism	TP=4, EP=1
Attention backend	NSA (`--attention-backend=nsa --nsa-decode-backend=trtllm --nsa-prefill-backend=trtllm`)
MoE backend	`--moe-runner-backend=flashinfer_trtllm`
KV cache dtype	`fp8_e4m3`
Quantization	`fp8` (target), nvfp4 weights, modelopt_fp4 quant algo
Trigger batch size	`--cuda-graph-max-bs=128 --max-running-requests=128`

Workaround	Result
Pin to `lmsysorg/sglang:v0.5.11-cu130`	✅ works
`--cuda-graph-max-bs 64 --max-running-requests 64` (cap below bs=128)	Not yet tested, but likely sidesteps the trigger; suggested per KLAUD_DEBUG playbook
Disable EAGLE (drop speculative decoding)	Likely works — only the draft graph capture crashes — but defeats the recipe's purpose
`SGL_ENABLE_JIT_DEEPGEMM=0`	Not tried — different code path

[Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) #25563

Description

human

AI generate below

Summary

Environment

Repro recipe

Observed timeline (per-TP rank, condensed from the worker log)

Workarounds we've tried

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions