fallback to triton mm_persistent kernel when deepGemm fail by zminglei · Pull Request #12911 · sgl-project/sglang

zminglei · 2025-11-09T07:16:45Z

Motivation

launch Qwen3-Next model with enabling deterministic would fail without this fix
python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-Next-80B-A3B-Instruct/ --tp 4 --context-length 262144 --mem-fraction-static 0.7 --enable-deterministic-inference

Without the fix:

    return forward_call(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 534, in mm_batch_invariant
    return matmul_persistent(a, b)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 279, in matmul_persistent
    return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 244, in _matmul_persistent_deepgemm
    deep_gemm.bf16_gemm_nn(a, b, out)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/deep_gemm/__init__.py", line 50, in _fn
    return func(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/_ops.py", line 1243, in __call__
    return self._op(*args, **kwargs)
RuntimeError: CUDA driver error (_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:108): 1 (CUDA_ERROR_INVALID_VALUE, invalid argument)

Failure reason: DeepGEMM targets optimizations for large-scale GEMM operations. The library's design assumes matrix dimensions that can accommodate block sizes of at least 64-128 in both M and N dimensions for efficient GPU utilization.

With fix:

[2025-11-09 06:46:32] INFO:     Application startup complete.
[2025-11-09 06:46:32] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-09 06:46:33] INFO:     127.0.0.1:48508 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-09 06:46:33 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 06:46:34] INFO:     127.0.0.1:48522 - "POST /generate HTTP/1.1" 200 OK
[2025-11-09 06:46:34] The server is fired up and ready to roll!

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

hebiao064 · 2025-11-09T07:24:53Z

@fzyzcjy pls review

fzyzcjy · 2025-11-09T07:49:57Z

-        return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
+        try:
+            return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
+        except RuntimeError:


nit: hmm I do hope not to do such try-catch, since it can lead to weird issues :/ What about

except RuntimeError: raise Exception('err, you should change the if condition above to use triton in this case, blahblha')

Trying to understand more here, I'm thinking deepgemm is like a optimized one we can always try and if it has issue we could fallback to the triton one which should work for almost all cases instead of failing the server. What kind of issues could it lead if we fallback to triton kernel?

my personal thought is, we may fallback in unexpected ways :/

also, this will waste cpu time

…12911)" This reverts commit 8a821af.

zminglei added 2 commits November 9, 2025 06:49

fallback to triton mm_persistent kernel when deepGemm fail

c1a8d8f

lint

7b01c5e

zminglei marked this pull request as ready for review November 9, 2025 07:20

hebiao064 self-assigned this Nov 9, 2025

hebiao064 approved these changes Nov 9, 2025

View reviewed changes

hebiao064 added the run-ci label Nov 9, 2025

Merge branch 'main' into fix-mm-persistent

1e6565f

zhyncs merged commit 8a821af into sgl-project:main Nov 9, 2025
28 of 66 checks passed

fzyzcjy reviewed Nov 9, 2025

View reviewed changes

fzyzcjy added a commit that referenced this pull request Nov 13, 2025

Revert "fallback to triton mm_persistent kernel when deepGemm fail (#…

a761add

…12911)" This reverts commit 8a821af.

fzyzcjy mentioned this pull request Nov 13, 2025

Revert "fallback to triton mm_persistent kernel when deepGemm fail" #13178

Merged

zminglei mentioned this pull request Nov 13, 2025

re-submit 12911 but relax the requirement for deepgemm #13226

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallback to triton mm_persistent kernel when deepGemm fail#12911

fallback to triton mm_persistent kernel when deepGemm fail#12911
zhyncs merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm-persistent

zminglei commented Nov 9, 2025

Uh oh!

hebiao064 commented Nov 9, 2025

Uh oh!

Uh oh!

fzyzcjy Nov 9, 2025 •

edited

Loading

Uh oh!

zminglei Nov 9, 2025 •

edited

Loading

Uh oh!

fzyzcjy Nov 9, 2025 •

edited

Loading

Uh oh!

fzyzcjy Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zminglei commented Nov 9, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

hebiao064 commented Nov 9, 2025

Uh oh!

Uh oh!

fzyzcjy Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zminglei Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fzyzcjy Nov 9, 2025 •

edited

Loading

zminglei Nov 9, 2025 •

edited

Loading

fzyzcjy Nov 9, 2025 •

edited

Loading