[Bug] DeepSeek V3.2 cannot launch on B200

### Checklist

- [ ] I searched related issues but found no solution.
- [ ] The bug persists in the latest version.
- [ ] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [ ] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Describe the bug

Launchin Dpsk V3.2 on B200 will meet the following bug during cuda graph capture
```bash
[2025-11-17 20:37:08 DP3 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 395, in __init__
    self.capture()
  File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 553, in capture
    _capture_one_stream()
  File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 537, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 743, in capture_one_batch_size
    run_once()
  File "/raid/data/hlu/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 730, in run_once
    logits_output_or_pp_proxy_tensors = forward(
                                        ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 3365, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 3175, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 2888, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 1435, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 1534, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/data/hlu/sglang/python/sglang/srt/models/deepseek_v2.py", line 2007, in forward_absorb_core
    torch.bmm(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int
)ldb, strideb, (void*)&fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
```

Locally build sgl-kernel can temporarily solve this.
```
cd sgl-kernel
make build
```

Potential cause: The kernel building workflow in #12969 uses torch 2.9, which contaminates the ccache for kernel building in the last sgl-kernel release 

Solution: Avoid using ccache for kernel releasing workflow


### Reproduction

```
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8
```

### Environment

sglang 0.5.5.post2, sgl-kernel 0.3.17.post1, B200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DeepSeek V3.2 cannot launch on B200 #13538

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DeepSeek V3.2 cannot launch on B200 #13538

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions