Checklist
Describe the bug
Launchin Dpsk V3.2 on B200 will meet the following bug during cuda graph capture
[2025-11-17 20:37:08 DP3 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 395, in __init__
self.capture()
File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 553, in capture
_capture_one_stream()
File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 537, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 743, in capture_one_batch_size
run_once()
File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 730, in run_once
logits_output_or_pp_proxy_tensors = forward(
^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 3365, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 3175, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 2888, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 1435, in forward
return self.forward_core(s)
^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 1534, in forward_core
return self.forward_absorb_core(*inner_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 2007, in forward_absorb_core
torch.bmm(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int
)ldb, strideb, (void*)&fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Locally build sgl-kernel can temporarily solve this.
Potential cause: The kernel building workflow in #12969 uses torch 2.9, which contaminates the ccache for kernel building in the last sgl-kernel release
Solution: Avoid using ccache for kernel releasing workflow
Reproduction
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8
Environment
sglang 0.5.5.post2, sgl-kernel 0.3.17.post1, B200
Checklist
Describe the bug
Launchin Dpsk V3.2 on B200 will meet the following bug during cuda graph capture
Locally build sgl-kernel can temporarily solve this.
Potential cause: The kernel building workflow in #12969 uses torch 2.9, which contaminates the ccache for kernel building in the last sgl-kernel release
Solution: Avoid using ccache for kernel releasing workflow
Reproduction
Environment
sglang 0.5.5.post2, sgl-kernel 0.3.17.post1, B200