Support moe topk sigmoid kernel#13049
Merged
ispobock merged 13 commits intosgl-project:mainfrom Nov 19, 2025
Merged
Conversation
Signed-off-by: xuebi <xuebi@minimaxi.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
Cool |
Collaborator
|
Thanks a lot, I have some question about |
Contributor
Author
|
I ran the comparison against the lmsysorg/sglang:dev image. On GSM8K, the accuracy was 0.9249 with the dev image and 0.9295 with this patch. For AIME2025, my measured accuracy is 0.803, while the official MiniMax-M2 model reports 0.78. lm_eval --model local-completions \
--model_args base_url=http://localhost:8000/v1/completions,tokenizer=/model,model=/model \
--tasks gsm8k_cot \
--batch_size 128 \
--num_fewshot 5
# topk_sigmoid
local-completions (base_url=http://localhost:8000/v1/completions,tokenizer=/model,model=/model), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 5|exact_match|↑ |0.9295|± |0.0071|
| | |strict-match | 5|exact_match|↑ |0.9158|± |0.0076
# lmsysorg/sglang:dev
local-completions (base_url=http://localhost:8000/v1/completions,tokenizer=/model,model=/model), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 5|exact_match|↑ |0.9249|± |0.0073|
| | |strict-match | 5|exact_match|↑ |0.9113|± |0.0078| |
Signed-off-by: xuebi <xuebi@minimaxi.com>
FlamingoPg
approved these changes
Nov 13, 2025
Collaborator
|
Please add a kernel micro benchmark, refer to https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark |
BBuf
approved these changes
Nov 16, 2025
Contributor
Author
|
I added micro-benchmarks, and below are the test results: Detailspython sgl-kernel/benchmark/bench_moe_topk_sigmoid.py
✅ Torch and SGLang topk_sigmoid implementations match
✅ Torch and SGLang topk_sigmoid implementations match
✅ Torch and SGLang topk_sigmoid implementations match
✅ Torch and SGLang topk_sigmoid implementations match
✅ Torch and SGLang topk_sigmoid implementations match
✅ Torch and SGLang topk_sigmoid implementations match
topk-sigmoid-performance:
num_tokens num_experts topk SGLang Torch
0 128.0 32.0 1.0 1.606173 14.432703
1 128.0 32.0 2.0 1.979544 18.997198
2 128.0 32.0 4.0 2.210692 20.242306
3 128.0 32.0 8.0 3.096960 21.853833
4 128.0 64.0 1.0 1.695568 14.660788
5 128.0 64.0 2.0 2.082407 19.980444
6 128.0 64.0 4.0 2.433674 19.924131
7 128.0 64.0 8.0 3.535085 21.541494
8 128.0 128.0 1.0 1.832434 14.898603
9 128.0 128.0 2.0 2.137683 19.689093
10 128.0 128.0 4.0 2.564278 21.878372
11 128.0 128.0 8.0 3.888136 23.623317
12 128.0 256.0 1.0 2.091580 15.850833
13 128.0 256.0 2.0 2.345080 21.415036
14 128.0 256.0 4.0 2.740070 21.677535
15 128.0 256.0 8.0 4.214144 24.138816
16 128.0 12.0 1.0 3.363476 11.810906
17 128.0 12.0 2.0 4.086299 19.232911
18 128.0 12.0 4.0 5.196735 19.951743
19 128.0 12.0 8.0 7.932992 21.698050
20 128.0 512.0 1.0 3.917015 19.575853
21 128.0 512.0 2.0 4.752946 24.443684
22 128.0 512.0 4.0 6.529157 25.821924
23 128.0 512.0 8.0 10.555618 28.067657
24 512.0 32.0 1.0 1.638882 17.125492
25 512.0 32.0 2.0 1.999735 23.120132
26 512.0 32.0 4.0 2.247950 23.036299
27 512.0 32.0 8.0 3.127355 23.925367
28 512.0 64.0 1.0 1.740753 15.580275
29 512.0 64.0 2.0 2.179911 23.328847
30 512.0 64.0 4.0 2.509109 24.131689
31 512.0 64.0 8.0 3.622789 24.281412
32 512.0 128.0 1.0 2.239927 18.465516
33 512.0 128.0 2.0 2.488771 26.375712
34 512.0 128.0 4.0 2.808165 26.076528
35 512.0 128.0 8.0 4.503539 26.507244
36 512.0 256.0 1.0 2.379934 24.298310
37 512.0 256.0 2.0 2.612163 31.943957
38 512.0 256.0 4.0 3.031515 32.432928
39 512.0 256.0 8.0 4.867126 32.959052
40 512.0 12.0 1.0 4.009180 15.938749
41 512.0 12.0 2.0 4.951622 22.493005
42 512.0 12.0 4.0 6.883740 22.376390
43 512.0 12.0 8.0 10.963751 22.862425
44 512.0 512.0 1.0 4.885562 36.050339
45 512.0 512.0 2.0 6.499027 42.411182
46 512.0 512.0 4.0 9.952232 45.588691
47 512.0 512.0 8.0 18.722834 46.178278
48 1024.0 32.0 1.0 1.679261 18.880756
49 1024.0 32.0 2.0 2.042411 24.436893
50 1024.0 32.0 4.0 2.285168 23.827322
51 1024.0 32.0 8.0 3.163615 24.795837
52 1024.0 64.0 1.0 1.885573 19.821977
53 1024.0 64.0 2.0 2.291389 25.667794
54 1024.0 64.0 4.0 2.631438 26.108969
55 1024.0 64.0 8.0 3.871169 26.612751
56 1024.0 128.0 1.0 2.505455 25.683555
57 1024.0 128.0 2.0 2.839051 31.938334
58 1024.0 128.0 4.0 3.360038 32.290966
59 1024.0 128.0 8.0 5.098808 32.825842
60 1024.0 256.0 1.0 2.739490 36.194292
61 1024.0 256.0 2.0 3.116406 41.909735
62 1024.0 256.0 4.0 3.740577 44.474649
63 1024.0 256.0 8.0 5.827517 44.637636
64 1024.0 12.0 1.0 5.902994 18.819952
65 1024.0 12.0 2.0 7.453297 23.324255
66 1024.0 12.0 4.0 10.854953 23.461952
67 1024.0 12.0 8.0 18.510209 23.939933
68 1024.0 512.0 1.0 7.175567 58.191771
69 1024.0 512.0 2.0 9.891369 64.094858
70 1024.0 512.0 4.0 16.385534 66.769785
71 1024.0 512.0 8.0 32.351776 69.287520
72 2048.0 32.0 1.0 1.829207 21.068421
73 2048.0 32.0 2.0 2.278805 26.714469
74 2048.0 32.0 4.0 2.528243 27.339196
75 2048.0 32.0 8.0 3.466488 27.450389
76 2048.0 64.0 1.0 2.053316 25.465890
77 2048.0 64.0 2.0 2.552917 31.511882
78 2048.0 64.0 4.0 3.000597 31.757535
79 2048.0 64.0 8.0 4.563727 33.727956
80 2048.0 128.0 1.0 2.777304 35.636103
81 2048.0 128.0 2.0 3.413895 43.372510
82 2048.0 128.0 4.0 4.428421 43.499526
83 2048.0 128.0 8.0 7.382649 44.146279
84 2048.0 256.0 1.0 3.026781 52.738649
85 2048.0 256.0 2.0 3.894977 64.701569
86 2048.0 256.0 4.0 5.129523 67.221177
87 2048.0 256.0 8.0 8.545173 67.414403
88 2048.0 12.0 1.0 9.092818 18.424893
89 2048.0 12.0 2.0 12.036246 26.188230
90 2048.0 12.0 4.0 18.287509 26.539472
91 2048.0 12.0 8.0 31.996807 26.236377
92 2048.0 512.0 1.0 11.394274 93.481671
93 2048.0 512.0 2.0 16.339157 112.272819
94 2048.0 512.0 4.0 28.238972 111.998075
95 2048.0 512.0 8.0 57.703463 117.038518
96 4096.0 32.0 1.0 1.909912 24.866872
97 4096.0 32.0 2.0 2.409515 34.820380
98 4096.0 32.0 4.0 2.832310 34.684684
99 4096.0 32.0 8.0 4.041181 34.478628
100 4096.0 64.0 1.0 2.338186 36.132783
101 4096.0 64.0 2.0 3.062580 44.554263
102 4096.0 64.0 4.0 4.054049 45.675747
103 4096.0 64.0 8.0 6.600514 44.666798
104 4096.0 128.0 1.0 3.378300 53.901834
105 4096.0 128.0 2.0 4.670523 65.184914
106 4096.0 128.0 4.0 6.755106 68.121627
107 4096.0 128.0 8.0 11.770620 68.404000
108 4096.0 256.0 1.0 4.146615 94.787196
109 4096.0 256.0 2.0 5.715296 108.005861
110 4096.0 256.0 4.0 8.249024 112.639070
111 4096.0 256.0 8.0 14.230330 114.896427
112 4096.0 12.0 1.0 14.855152 22.875291
113 4096.0 12.0 2.0 20.332816 32.244339
114 4096.0 12.0 4.0 32.310028 33.572924
115 4096.0 12.0 8.0 58.242789 33.343229
116 4096.0 512.0 1.0 19.630381 157.883435
117 4096.0 512.0 2.0 29.160358 165.134546
118 4096.0 512.0 4.0 51.915062 166.312485
119 4096.0 512.0 8.0 108.370604 166.565762
120 8192.0 32.0 1.0 2.202498 31.234844
121 8192.0 32.0 2.0 2.978675 51.025664
122 8192.0 32.0 4.0 3.775391 49.964385
123 8192.0 32.0 8.0 5.863031 50.203381
124 8192.0 64.0 1.0 2.968514 57.305209
125 8192.0 64.0 2.0 4.251871 68.014053
126 8192.0 64.0 4.0 6.086570 69.243657
127 8192.0 64.0 8.0 10.536131 71.471305
128 8192.0 128.0 1.0 5.247310 93.226201
129 8192.0 128.0 2.0 7.593929 111.472821
130 8192.0 128.0 4.0 11.353238 117.777406
131 8192.0 128.0 8.0 20.460854 121.017956
132 8192.0 256.0 1.0 6.502393 172.724941
133 8192.0 256.0 2.0 9.149885 187.754140
134 8192.0 256.0 4.0 13.808958 208.527293
135 8192.0 256.0 8.0 24.761531 207.168873
136 8192.0 12.0 1.0 26.413523 33.157307
137 8192.0 12.0 2.0 37.361001 48.213042
138 8192.0 12.0 4.0 60.033314 47.779363
139 8192.0 12.0 8.0 110.748711 46.231309
140 8192.0 512.0 1.0 36.897926 280.690457
141 8192.0 512.0 2.0 55.704671 292.949430
142 8192.0 512.0 4.0 100.482323 293.560653
143 8192.0 512.0 8.0 212.114406 295.395413
144 16384.0 32.0 1.0 2.796739 48.714223
145 16384.0 32.0 2.0 3.976485 77.360224
146 16384.0 32.0 4.0 5.612998 79.568379
147 16384.0 32.0 8.0 9.495054 80.312729
148 16384.0 64.0 1.0 4.511705 94.415055
149 16384.0 64.0 2.0 6.764167 114.123263
150 16384.0 64.0 4.0 10.206392 125.649587
151 16384.0 64.0 8.0 18.366090 117.413979
152 16384.0 128.0 1.0 8.200936 166.785992
153 16384.0 128.0 2.0 12.688144 204.176349
154 16384.0 128.0 4.0 20.394364 203.288089
155 16384.0 128.0 8.0 37.700544 220.648766
156 16384.0 256.0 1.0 10.139636 330.431987
157 16384.0 256.0 2.0 15.292813 376.930517
158 16384.0 256.0 4.0 24.716366 384.972801
159 16384.0 256.0 8.0 45.595276 413.776336
160 16384.0 12.0 1.0 49.194846 34.271523
161 16384.0 12.0 2.0 70.368763 76.725981
162 16384.0 12.0 4.0 116.357078 74.920762
163 16384.0 12.0 8.0 215.335563 71.006188
164 16384.0 512.0 1.0 90.523115 543.311063
165 16384.0 512.0 2.0 125.340486 559.867136
166 16384.0 512.0 4.0 212.382089 562.431008
167 16384.0 512.0 8.0 432.574569 579.504967
168 32768.0 32.0 1.0 4.206019 96.088441
169 32768.0 32.0 2.0 6.262916 138.751874
170 32768.0 32.0 4.0 9.389842 136.735505
171 32768.0 32.0 8.0 16.656640 143.173997
172 32768.0 64.0 1.0 7.315950 176.322663
173 32768.0 64.0 2.0 11.443072 223.520559
174 32768.0 64.0 4.0 18.351364 225.893009
175 32768.0 64.0 8.0 33.928980 227.137949
176 32768.0 128.0 1.0 14.121456 289.869092
177 32768.0 128.0 2.0 22.853408 372.613327
178 32768.0 128.0 4.0 38.377109 414.211812
179 32768.0 128.0 8.0 72.507660 420.528702
180 32768.0 256.0 1.0 17.926437 663.191171
181 32768.0 256.0 2.0 27.857530 755.398127
182 32768.0 256.0 4.0 46.754758 780.346909
183 32768.0 256.0 8.0 87.830342 821.121382
184 32768.0 12.0 1.0 95.320391 75.039072
185 32768.0 12.0 2.0 138.106252 131.030634
186 32768.0 12.0 4.0 227.255174 122.265509
187 32768.0 12.0 8.0 425.842783 127.552931
188 32768.0 512.0 1.0 183.148865 1143.182137
189 32768.0 512.0 2.0 249.909328 1172.065973
190 32768.0 512.0 4.0 422.945727 1174.436033
191 32768.0 512.0 8.0 859.151301 1187.168956 |
yukavio
pushed a commit
to yukavio/sglang
that referenced
this pull request
Nov 25, 2025
* [model-gateway] update workflow names for gateway and exclude npu (sgl-project#13415) * [Tiny fix] Fix bench_speculative.py run bug (sgl-project#13416) * [model-gateway] Add Gateway Release Tooling (sgl-project#13420) * fix uneven PP layer indices (sgl-project#13282) Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> * diffusion: fix wan2.2 ti2v num_frames adjust logic (sgl-project#13379) Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> * [PD][bug fix] fix memleak when last_batch is none (sgl-project#13144) Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> * Fix cache_tokens calculate issue when retracted (sgl-project#11900) Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> * [feature] Custom base path on FastAPI server (sgl-project#5879) Co-authored-by: lianhu.yin <lianhu.yin@nio.com> Co-authored-by: kebyn <kebyn@kebyn.cc> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> * Adding user defined hooks support (sgl-project#13217) * Fix log time stats (sgl-project#13418) * [Ci tiny fix] Lower score threshold in evaluation test (sgl-project#13443) * diffusion: fix loading with local model_path (sgl-project#13445) * [2/N] CI refactor: sperate some backend-independent CPU tasks. (sgl-project#13447) * Temporarily disable model hooks CI (sgl-project#13450) * [Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa (sgl-project#13022) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> * Remove verbs from GET endpoint paths to follow REST standards (sgl-project#13273) * Add missing models (sgl-project#13456) * extend sagemaker.Dockerfile serve script to allow all sglang serve flags (sgl-project#13173) * Fix 8-gpu B200 nightly tests (sgl-project#13457) * Fixes validation errors for Wan-AI models which store model weights in subdirectories (sgl-project#13461) * [Embeddings Performance Testing] Add performance test for embedding models (sgl-project#12359) * [NVIDIA] Fix broken fp8 MoE of deepseek v3 (sgl-project#13264) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Temporarily comment out multimodal gen test to recover runners (sgl-project#13463) * Update pr-test.yml to fix invalid job name error * Add interface_v1 option for dynamic HiCache backend (sgl-project#13140) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 (sgl-project#13455) * fix MambaPool clear method after refactoring (sgl-project#13449) * [AMD CI] Update sgl-router python path in dockerfile. (sgl-project#13458) * [CI] re-enable test_vision_openai_server_a ci (sgl-project#13444) * Adding CI Monitor Improvements (sgl-project#13462) * [GLM4.6v] Required changes for bumping up to transformer 5.x (sgl-project#13229) * [GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release (sgl-project#13258) * [model-gateway] use worker startup time out for worker registration (sgl-project#13473) * model: support JetVLM (sgl-project#13289) * chore: add an unified server arg for multimodal inputs preprocess config(sgl-project#12149) Co-authored-by: bianfeng <bianfeng@pinduoduo.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> * [PD] Clarify init method docstrings for kvsender and kvreceiver (sgl-project#13476) * Fix lora test (sgl-project#13479) * [Piecewise CUDA Graph] Support ModelOpt FP8 (sgl-project#13094) * CI: fix NFS EBUSY error in PR test workflow (sgl-project#13460) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [CI] fix triggered by a non-run-ci label (sgl-project#13393) * [CI] remove auto-labeling `run-ci` label. (sgl-project#13486) * fix: change performance log directory to cache path (sgl-project#13482) Co-authored-by: Mick <mickjagger19@icloud.com> * [CI] Add input for pr-gate (sgl-project#13491) * [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel (sgl-project#13374) * [CI] fix lint yml (syntax error) (sgl-project#13496) * [VLM][feat] Support encoder DP for Qwen2.5-VL (sgl-project#13126) Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: liusy58 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> * [HiCache] Critical fix to host memory double free (sgl-project#13501) Co-authored-by: Hao Chen <cighao@gmail.com> * [BugFix] Accuracy and function Issue when run ptpc quant model (sgl-project#13157) Co-authored-by: yuechguo <yuechguo@amd.com> * fix: create git tags directly instead of temporary branches (sgl-project#13168) * Add .github/CI_PERMISSIONS.json to define the CI permissions (sgl-project#13509) Co-authored-by: sglang-bot <sglangbot@gmail.com> * README.md -> FOLDER_README.md (sgl-project#13510) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Use slash command to trigger CI (sgl-project#13512) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Add docs on trigger ci (sgl-project#13513) Co-authored-by: sglang-bot <sglangbot@gmail.com> * [Feature] Re:Enable hybrid mem saver (sgl-project#12962) * Trigger CI retry with edit (sgl-project#13516) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Update docs (sgl-project#13519) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Add /tag-and-rerun-ci (sgl-project#13521) * [CI] update pr-gate to be compatible with new slash triggering mananer. (sgl-project#13522) * [CI] fix skipping pr-gate on main (sgl-project#13525) * Small cleanups related to LoRA weight loading (sgl-project#13474) * [CI] fix CI skipped on main (sgl-project#13527) * [model-gateway] fix gateway docker build due to recent py code change (sgl-project#13532) * [model-gateway] limit opened files in docker build to fix edge case (sgl-project#13536) * [docker] fix dockerfile naming for diffusion (sgl-project#13534) * fix lora test (sgl-project#13537) * Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing (sgl-project#13540) * [fix] Fixes accuracy issues caused by incorrect use of rope (sgl-project#13495) * Flashinfer TRTLLM-GEN-MoE + Qwen3 (sgl-project#13489) * [chore] Disable ccache for sgl-kernel release (sgl-project#13541) * Add Qwen/Qwen1.5-MoE-A2.7B to model list (sgl-project#13543) * [Fix] Fix DeepSeek V3 MTP on B200 (sgl-project#13548) * [router][grpc] Support num_reasoning_tokens in haromy models (sgl-project#13047) * [feat][Ascend][Mindspore]: support model-impl of mindspore (sgl-project#9234) * [AMD CI] Local cache fallback. (sgl-project#13452) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix amd 1 gpu basic test (sgl-project#13551) * [Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking (sgl-project#12740) * purge unnecessary env variable set in deterministic test (sgl-project#13481) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13542) * Add `lmsys/gpt-oss-20b-bf16` to model validation check (sgl-project#13557) * CI Failure Monitor Improvements (sgl-project#13558) * [RL] Allow passing tensors of different dtypes for FlattenedTensorBucket (sgl-project#13413) * [CI] Fix CUDA workflow's dependency. (sgl-project#13568) * [NPU] Adapt pr-gate for pr-test workflow & workflows refresh (sgl-project#13567) * Tiny enhance test suites sanity check (sgl-project#13589) * [3/N] CI refactor: move some manually triggered tests. (sgl-project#13448) * Support moe topk sigmoid kernel (sgl-project#13049) Co-authored-by: xuebi <xuebi@minimaxi.com> * Expend compatibility check for all quantized MoE models (sgl-project#13465) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * add https://github.com/netanel-haber to CI_PERMISSIONS.json (sgl-project#13577) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13570) * [Auto Sync] Update base_grammar_backend.py, collector.py (20251116) (sgl-project#13357) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <sehoon@x.ai> * [GDN] Remove unnecessary contiguous() (sgl-project#13604) * [GDN] Remove unnecessary conv state clone (sgl-project#13603) * [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL (sgl-project#13055) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Yuhao Yang <yhyang201@gmail.com> * [diffusion] CI: improve diffusion CI (sgl-project#13562) Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> * feat: support external custom models (sgl-project#13429) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [CI fix] Fix image download failures in VLM CI tests (sgl-project#13613) * [NVIDIA] Add fp8 gemm benchmark on blackwell (sgl-project#13528) * [UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests (sgl-project#12379) * [diffusion] refactor: remove PreprocessorConfig (sgl-project#13248) * [diffusion] refactor: refactor pipeline folders (sgl-project#13253) * Add FP32 dtype support for RoPE - Part2 (sgl-project#13328) * [diffusion] fix: remove multimodal_gen redundant get_bool_env_var func (sgl-project#13583) Co-authored-by: Mick <mickjagger19@icloud.com> * Add support for new aiter version (AR accuracy, is_shuffled PR) (sgl-project#13554) Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> * diffusion: improve baseline performance monitor (sgl-project#13614) * [Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) (sgl-project#13453) * [CI] Align metric units for CI rate limit (sgl-project#13633) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel (sgl-project#13617) Co-authored-by: root <root@smci355-ccs-aus-m12-17.cs-aus.dcgpu> Co-authored-by: jacky.cheng <yichiche@amd.com> * fix bench_speculative bug (sgl-project#13197) * Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" (sgl-project#13644) * [CI] optimize CI workflow info (sgl-project#13634) * CI: Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback (sgl-project#13637) * [CI] apply pr-gate for XPU (sgl-project#13663) * Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next (sgl-project#11577) * [10/n] decouple quantization impl from vllm dependency - fix import (sgl-project#13524) * Adding nightly tests as release guard for bot bump workflows (sgl-project#13655) * [DeepseekV3.2] Deepseek fp8 support for MHA path (sgl-project#12964) * Fix launch of `Olmo3` (sgl-project#13666) Signed-off-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Deepseek V3.2] Change indexer weights_proj to fp32 (sgl-project#13459) * enable csgmv automatically on cuda (sgl-project#13600) * Add nightly test CI monitor workflow (sgl-project#13038) * allow loras to be implicitly evicted and loaded based on max_loaded_loras (sgl-project#11526) * Test reorganization: Move tests to manual/ (sgl-project#13610) * [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 (sgl-project#13667) Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> * Super tiny remove unused MiniMaxM2MLP class (sgl-project#13659) * Update quantization.md with new model resources (sgl-project#13677) * [model-gateway] add both python and rust cli alias (sgl-project#13678) * [diffusion] CI: improve validation method (sgl-project#13627) * [model-gateway] fix gateway cli arg parser to not use = (sgl-project#13685) * [CI] Move nightly tests to test/nightly/ (sgl-project#13683) * [NVIDIA] Add cutedsl e2e test to GB200 CI (sgl-project#12672) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Add sgl-kernel CI test for Blackwell (B200) (sgl-project#13301) * remove unnecessary starvation check (sgl-project#13619) * Fix target MLA with eagle3 support for PD disaggregation (sgl-project#13555) Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> * [kimi k2 thinking] Avoid useless torch.zeros_ (sgl-project#13596) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size (sgl-project#13587) * [VLM] Support Piecewise CUDA Graph for InternVL (sgl-project#13640) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Piecewise Cuda Graph] rename, refactor and add more logging (sgl-project#13675) Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> * [difusion] CI: speed up multimodal_gen ci (sgl-project#13665) Co-authored-by: Mick <mickjagger19@icloud.com> * [diffusion] doc: minor update docs (sgl-project#13177) * Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 (sgl-project#13686) * [diffusion] server: use meta to avoid Linear init for TextEncoder (sgl-project#13564) Co-authored-by: Mick <mickjagger19@icloud.com> * [Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) (sgl-project#13679) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Zhuqi Li <zhli@x.ai> * [Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers (sgl-project#13590) Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> * [HiCache] fix unit test with changed new APIs (sgl-project#13498) * [Fix] Qwen3Next lmhead dtype (sgl-project#13708) * [NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 (sgl-project#13647) * [11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks (sgl-project#13327) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Clean code] Compressed_tensors_moe code clean (sgl-project#13719) * [diffusion] profile: support performance metric dumping and comparison (sgl-project#13630) * [AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model (sgl-project#13705) Co-authored-by: yctseng0211 <yctseng@amd.com> * [diffusion] doc: add contributing.md (sgl-project#13649) * fix 3fs down, lock schedule main thread (sgl-project#13407) * Fix url: use https://roadmap.sglang.io for roadmap (sgl-project#13733) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Super tiny delete unused files (sgl-project#13734) * [diffusion] log: minor improve logging (sgl-project#13735) * [CI] minor hot fix of model validation list (sgl-project#13737) * Add to ci permission (sgl-project#13739) * [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) (sgl-project#13466) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix: CI monitor should not exit with error on regressions (sgl-project#13694) * Revert "enable csgmv automatically on cuda" (sgl-project#13707) * Support torch 12.9 + DeepEP by removing custom nvshmem (sgl-project#12949) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * add some more labels (sgl-project#13701) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * Feat/nemotron nano v3 support (sgl-project#12690) * Fix global scaling factor loading hang (sgl-project#13484) * Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue (sgl-project#13746) * fix test_lora_update.py starvation message check (sgl-project#13702) * Fix model weights validation with automatic cache cleanup (sgl-project#13729) * [Auto Sync] Update evict_policy.py, radix_cache.py (20251120) (sgl-project#13669) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: cctry <shiyang@x.ai> * [Tiny] Renaming environ for NVFP4 dispatch (sgl-project#13756) * modularize gsm8k and mmmu test classes (sgl-project#13506) * Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) (sgl-project#9405) * [Ascend] support Kimi-K2-Thinking (sgl-project#12759) Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: richhuan <huan_rz@qq.com> Co-authored-by: ZhengdQin <46387172+ZhengdQin@users.noreply.github.com> * Refactor eagle bigram key matching (sgl-project#13714) * [diffusion] fix: fix hunyuanvideo and add 2-gpu ci test (sgl-project#13720) Co-authored-by: Mick <mickjagger19@icloud.com> * Update mem checker during busy (sgl-project#13704) * Tiny support different prompts in `send_one.py` (sgl-project#13768) * [diffusion] refactor: refactor sampling params (sgl-project#13706) * [VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series (sgl-project#13736) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Spec v2] Remove `allocate_lens` and enable over-allocation (sgl-project#13478) * [diffusion] CI: tinyfix diffusion ci (sgl-project#13769) Co-authored-by: Mick <mickjagger19@icloud.com> * align code style eagle draft&draft_extend cuda graph runner (sgl-project#13533) * Refactor MHA & MLA KV caches to support FP4 (sgl-project#13547) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> * Move unnecessary input_addr capture under debug mode flag for speed-up (sgl-project#13690) * Gather static input buffers for cuda graph (sgl-project#13676) * Revert "Fix RMSNorm API CALL mismatch issue. (sgl-project#10032)" (sgl-project#13727) * [model-gateway] update smg code owner (sgl-project#13777) * [model-gateway] clean up router manager function order (sgl-project#13776) * Fix typo in docs (sgl-project#13709) * [Feature] HiCache JIT kernel (once again) (sgl-project#13764) * [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode (sgl-project#13787) * Upgrade flashmla kernel for NSA tp support (sgl-project#13718) * [diffusion] feat: support sp for image models (sgl-project#13180) * [diffusion] CI: add run_suite to multimodal_gen CI (sgl-project#13791) * Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection (sgl-project#13781) * [Scheduler] Tiny organize code style (sgl-project#13806) * [Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments (sgl-project#13687) * [CI] Tiny refactoring sgl-kernel tests (sgl-project#13813) * Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 (sgl-project#13815) * make trtllm attn backend's init_forward_metadat non blocking (sgl-project#13802) * remove package json which is not used (sgl-project#13810) * [1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell (sgl-project#13601) Co-authored-by: fy1214 * chore: bump sgl-kernel version to 0.3.18 (sgl-project#13816) * xgrammar up version to 0.1.27 (sgl-project#13650) * Fix bug: Incorrect variable used in rem_total_token_offset calculatio… (sgl-project#13201) * [Doc] Refine fused_moe_triton configs doc (sgl-project#13820) * Update MindSpore documentation (sgl-project#13656) Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Refactor cache init logic (sgl-project#13800) * [Bugfix] Add jit kernel files in packaging (sgl-project#13829) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Xu Yongfei <xuyongfei.xyf@antgroup.com> * [diffusion] doc: minor update contributing.md with test section (sgl-project#13792) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [misc] Rename minilb install env & remove files & fix lint (sgl-project#13831) * [diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring (sgl-project#13833) Co-authored-by: Mick <mickjagger19@icloud.com> * [chore]Upgrade flashinfer to 0.5.3 (sgl-project#13751) * [Intel XPU]support xgrammar backend for intel xpu (sgl-project#13245) * [sgl-kernel Code Clean] Remove useless lightning_attention kernel (sgl-project#13819) * [VLM] Revise InternVL Piecewise CUDA Graph Supporting (sgl-project#13846) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix TorchAO quant in VLM (sgl-project#13508) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> * [Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id (sgl-project#13713) Signed-off-by: vito.yy <vito.yy@antgroup.com> * Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (sgl-project#11871) * [Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x (sgl-project#13612) Signed-off-by: lzy <tomlzy213@gmail.com> Co-authored-by: lzy <tomlzy213@gmail.com> * Tiny unpin uvloop for other backends (sgl-project#13858) * [model-gateway] Refactor router e2e responses tests (sgl-project#13745) Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> * [Perf] Optimize DeepSeek-R1 w4afp8 glue kernels (sgl-project#10027) Co-authored-by: Fan Yin <1106310035@qq.com> * Fix quantized moe checker fail for Qwen3 dense fp8 model (sgl-project#13853) * [model-gateway] add grpc server code owner (sgl-project#13865) * [BugFix] fix outplace_fused_experts missing is_gated (sgl-project#13864) * fix xgrammar_backend crash with malformed inputs (sgl-project#13752) * [Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) (sgl-project#13763) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Hanming Lu <hanming@x.ai> * [Doc] Add an Introduction to Expert Parallelism (sgl-project#13783) * add LoRA warning if loading a preexisting LoRA adapter with a different name (sgl-project#13822) * [NPU] Fix NPU CI (sgl-project#13834) Co-authored-by: c30031083 <chenxu140@huawei.com> * Overlap glm moe gemms in two cuda streams (sgl-project#13786) * [Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) (sgl-project#13487) * Add support for bf16 x bf16 cutlass fused MoE (sgl-project#10275) Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: jackeyhua <jackeyhuasjtu@gmail.com> * [Router bugfix] Fix router_manager selecting the wrong router when enable-igw. (sgl-project#13572) * Fix nightly test job to fail when any test fails (sgl-project#13871) * [diffusion] refactor: remove training-related code (sgl-project#13860) * [CI] fix multimodel-gen-test job (sgl-project#13874) * [diffusion] CI: add validation and cleanup for corrupted safetensors in multimodal loader (sgl-project#13870) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix lint error (sgl-project#13891) * fix: draft model revision misuse model revision (sgl-project#11893) * Fix trace publish paths in nightly-test-nvidia workflow (sgl-project#13888) * Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 (sgl-project#13890) * [Fix] JIT kernel dependencies in other platforms (sgl-project#13889) * remove RoPE CPU fp32 tests (sgl-project#13827) Co-authored-by: Fan Yin <1106310035@qq.com> * Move test_dummy_grok_models.py from manual to srt (temporary) (sgl-project#13901) * [CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric (sgl-project#13793) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * update flashinfer_cubin==0.5.3 (sgl-project#13848) * fix * fix --------- Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Signed-off-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Signed-off-by: vito.yy <vito.yy@antgroup.com> Signed-off-by: lzy <tomlzy213@gmail.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: AlphaBaby <fujianhao1997@qq.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Mike Qiu <qdy220091330@gmail.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: kebyn <kebuyuni@gmail.com> Co-authored-by: lianhu.yin <lianhu.yin@nio.com> Co-authored-by: kebyn <kebyn@kebyn.cc> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Carlo Mussolini <48855305+Carlomus@users.noreply.github.com> Co-authored-by: Rain H <2510421000@qq.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com> Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: Zijian Zhang <35801754+futrime@users.noreply.github.com> Co-authored-by: wingedge <handkodu@gmail.com> Co-authored-by: bianfeng <bianfeng@pinduoduo.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: alisonshao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com> Co-authored-by: liusy58 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: Hao Chen <cighao@gmail.com> Co-authored-by: Morpheus Guo <yuechao.guo@amd.com> Co-authored-by: yuechguo <yuechguo@amd.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com> Co-authored-by: Baidu-AIAK <Baidu_AIAK@163.com> Co-authored-by: Chen Haozhe <c-34@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ykwd <oneday117@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: Roger Young <42564206+rogeryoungh@users.noreply.github.com> Co-authored-by: xuebi <xuebi@minimaxi.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <sehoon@x.ai> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Yuhao Yang <yhyang201@gmail.com> Co-authored-by: StonyPort <157573149+zhooooong@users.noreply.github.com> Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> Co-authored-by: Zeyu Li <li_zeyu@pku.edu.cn> Co-authored-by: iLeGend <824040212@qq.com> Co-authored-by: joesun <shauntajoesph@gmail.com> Co-authored-by: Thomas Wang <1am9trash@gmail.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: yctseng0211 <yctseng@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m12-17.cs-aus.dcgpu> Co-authored-by: jacky.cheng <yichiche@amd.com> Co-authored-by: Lzhang-hub <57925599+Lzhang-hub@users.noreply.github.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: zyksir <zhuyikai.zyk@gmail.com> Co-authored-by: Zhuqi Li <zhli@x.ai> Co-authored-by: Michele Marzollo <37903931+michelemarzollo@users.noreply.github.com> Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com> Co-authored-by: weibingo <weibing_lai@163.com> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: cctry <shiyang@x.ai> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: richhuan <huan_rz@qq.com> Co-authored-by: ZhengdQin <46387172+ZhengdQin@users.noreply.github.com> Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Co-authored-by: ErsongWang <158176536+ErsongWang@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Swipe4057 <106391009+Swipe4057@users.noreply.github.com> Co-authored-by: liuhuijiayou <46172426+liuhuijiayou@users.noreply.github.com> Co-authored-by: Tiance Wang <wangtiance@gmail.com> Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: Xu Yongfei <xuyongfei.xyf@antgroup.com> Co-authored-by: gaopengff <pengfei.gao@intel.com> Co-authored-by: ant-yy <vito.yy@antgroup.com> Co-authored-by: Zhi Yiliu <2584074296@qq.com> Co-authored-by: lzy <tomlzy213@gmail.com> Co-authored-by: Xinyue Zhang <xinyue.zhang@oracle.com> Co-authored-by: Yuhao Yao <37280700+yuhyao@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Hanming Lu <hanming@x.ai> Co-authored-by: c30031083 <chenxu140@huawei.com> Co-authored-by: Nicolas Castet <26874160+nvcastet@users.noreply.github.com> Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: jackeyhua <jackeyhuasjtu@gmail.com> Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com> Co-authored-by: Yibo Cai <cyb70289@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Zaili Wang <109502517+ZailiWang@users.noreply.github.com> Co-authored-by: josephyou <josephyou@tencent.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR introduces a
topk_sigmoidCUDA kernel to support MiniMax-M2 that require sigmoid-based expert routing. Our previously workaround was to usegrouped_topkwithgroup_size=1.Previous: 8 kernels were launched.
Now: Only 1 kernel is launched.
Modifications
Accuracy Tests
We have validated the correctness of this change on MiniMax-M2, achieving an accuracy of 0.93 on GSM8K and 0.803 on AIME2025.
Benchmarking and Profiling
Performance benchmarks on the MiniMax-M2 model confirm that this optimization improves overall throughput by approximately 10%.
Checklist