re-submit 12911 but relax the requirement for deepgemm by zminglei · Pull Request #13226 · sgl-project/sglang

zminglei · 2025-11-13T21:46:15Z

Motivation

re-submit #12911 but relax the requirement for deepgemm, only fallback to triton kernel when it's needed. Here when N < 16 deepgemm would throw CUDA Exception with its minimum block_n = 16.

Verified Qwen3-Next could launch successfully now. It would fail without the fix as deepgemm would fail for b.shape=[2048, 1] where N = 1.
Verified no perf regression for models like Qwen3-8B, Qwen3-4B with enabling deterministic as it's still using deepgemm as all its mm has N >= 16 (batch size is M not N)

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2/ --mem-fraction-static 0.8 --enable-deterministic-inference

Main brach:

python3 -m sglang.test.send_one

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    4.503    |  512   |   1.000    |     113.71      |
+-------------+--------+------------+-----------------+

Current change:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    4.502    |  512   |   1.000    |     113.72      |
+-------------+--------+------------+-----------------+

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-4B/9e1b55c76f4b5bf0d14d37da8010110060f512e0/ --enable-deterministic-inference
Main brach:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.509    |  512   |   1.000    |     145.90      |
+-------------+--------+------------+-----------------+

Current change:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.508    |  512   |   1.000    |     145.94      |
+-------------+--------+------------+-----------------+

With the old fix (it's very slow)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.352    |  512   |   1.000    |      61.30      |
+-------------+--------+------------+-----------------+

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

fzyzcjy

code LGTM, w/ some extra tests this is ready to merge

zminglei · 2025-11-15T06:08:34Z

code LGTM, w/ some extra tests this is ready to merge

Thanks, I just added the test results in the PR description.

fzyzcjy · 2025-11-15T07:36:30Z

LGTM

zminglei added 2 commits November 13, 2025 21:32

relax the dim requirement for deep_gemm

05018eb

lint

4424776

zminglei marked this pull request as ready for review November 13, 2025 21:49

zminglei mentioned this pull request Nov 14, 2025

[Deterministic] Support Qwen3-Next model deterministic inference #13100

Merged

4 tasks

Merge branch 'main' into fix-mm

87bf249

fzyzcjy approved these changes Nov 15, 2025

View reviewed changes

fzyzcjy merged commit 8a43734 into sgl-project:main Nov 15, 2025
43 of 48 checks passed

zminglei mentioned this pull request Nov 18, 2025

purge unnecessary env variable set in deterministic test #13481

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-submit 12911 but relax the requirement for deepgemm#13226

re-submit 12911 but relax the requirement for deepgemm#13226
fzyzcjy merged 3 commits intosgl-project:mainfrom
zminglei:fix-mm

zminglei commented Nov 13, 2025 •

edited

Loading

Uh oh!

fzyzcjy left a comment

Uh oh!

zminglei commented Nov 15, 2025

Uh oh!

fzyzcjy commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zminglei commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

zminglei commented Nov 15, 2025

Uh oh!

fzyzcjy commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zminglei commented Nov 13, 2025 •

edited

Loading