Skip to content

fallback to triton mm_persistent kernel when deepGemm fail#12911

Merged
zhyncs merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm-persistent
Nov 9, 2025
Merged

fallback to triton mm_persistent kernel when deepGemm fail#12911
zhyncs merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm-persistent

Conversation

@zminglei
Copy link
Copy Markdown
Collaborator

@zminglei zminglei commented Nov 9, 2025

Motivation

launch Qwen3-Next model with enabling deterministic would fail without this fix
python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-Next-80B-A3B-Instruct/ --tp 4 --context-length 262144 --mem-fraction-static 0.7 --enable-deterministic-inference

Without the fix:

    return forward_call(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 534, in mm_batch_invariant
    return matmul_persistent(a, b)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 279, in matmul_persistent
    return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py", line 244, in _matmul_persistent_deepgemm
    deep_gemm.bf16_gemm_nn(a, b, out)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/deep_gemm/__init__.py", line 50, in _fn
    return func(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/_ops.py", line 1243, in __call__
    return self._op(*args, **kwargs)
RuntimeError: CUDA driver error (_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:108): 1 (CUDA_ERROR_INVALID_VALUE, invalid argument)

Failure reason: DeepGEMM targets optimizations for large-scale GEMM operations. The library's design assumes matrix dimensions that can accommodate block sizes of at least 64-128 in both M and N dimensions for efficient GPU utilization.

With fix:

[2025-11-09 06:46:32] INFO:     Application startup complete.
[2025-11-09 06:46:32] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-09 06:46:33] INFO:     127.0.0.1:48508 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-09 06:46:33 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 06:46:34] INFO:     127.0.0.1:48522 - "POST /generate HTTP/1.1" 200 OK
[2025-11-09 06:46:34] The server is fired up and ready to roll!

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@zminglei zminglei marked this pull request as ready for review November 9, 2025 07:20
@hebiao064 hebiao064 self-assigned this Nov 9, 2025
@hebiao064
Copy link
Copy Markdown
Collaborator

@fzyzcjy pls review

@zhyncs zhyncs merged commit 8a821af into sgl-project:main Nov 9, 2025
28 of 66 checks passed
return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
try:
return _matmul_persistent_deepgemm(a=a, b=b, bias=bias)
except RuntimeError:
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hmm I do hope not to do such try-catch, since it can lead to weird issues :/ What about

except RuntimeError:
  raise Exception('err, you should change the if condition above to use triton in this case, blahblha')

Copy link
Copy Markdown
Collaborator Author

@zminglei zminglei Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand more here, I'm thinking deepgemm is like a optimized one we can always try and if it has issue we could fallback to the triton one which should work for almost all cases instead of failing the server. What kind of issues could it lead if we fallback to triton kernel?

Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my personal thought is, we may fallback in unexpected ways :/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this will waste cpu time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants