Skip to content

re-submit 12911 but relax the requirement for deepgemm#13226

Merged
fzyzcjy merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm
Nov 15, 2025
Merged

re-submit 12911 but relax the requirement for deepgemm#13226
fzyzcjy merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm

Conversation

@zminglei
Copy link
Copy Markdown
Collaborator

@zminglei zminglei commented Nov 13, 2025

Motivation

re-submit #12911 but relax the requirement for deepgemm, only fallback to triton kernel when it's needed. Here when N < 16 deepgemm would throw CUDA Exception with its minimum block_n = 16.

  1. Verified Qwen3-Next could launch successfully now. It would fail without the fix as deepgemm would fail for b.shape=[2048, 1] where N = 1.
  2. Verified no perf regression for models like Qwen3-8B, Qwen3-4B with enabling deterministic as it's still using deepgemm as all its mm has N >= 16 (batch size is M not N)

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2/ --mem-fraction-static 0.8 --enable-deterministic-inference

Main brach:

python3 -m sglang.test.send_one

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    4.503    |  512   |   1.000    |     113.71      |
+-------------+--------+------------+-----------------+

Current change:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    4.502    |  512   |   1.000    |     113.72      |
+-------------+--------+------------+-----------------+

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-4B/9e1b55c76f4b5bf0d14d37da8010110060f512e0/ --enable-deterministic-inference
Main brach:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.509    |  512   |   1.000    |     145.90      |
+-------------+--------+------------+-----------------+

Current change:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.508    |  512   |   1.000    |     145.94      |
+-------------+--------+------------+-----------------+

With the old fix (it's very slow)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.352    |  512   |   1.000    |      61.30      |
+-------------+--------+------------+-----------------+

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@zminglei zminglei marked this pull request as ready for review November 13, 2025 21:49
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM, w/ some extra tests this is ready to merge

Image

@zminglei
Copy link
Copy Markdown
Collaborator Author

code LGTM, w/ some extra tests this is ready to merge

Image

Thanks, I just added the test results in the PR description.

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Nov 15, 2025

LGTM

@fzyzcjy fzyzcjy merged commit 8a43734 into sgl-project:main Nov 15, 2025
43 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants