moe fused_gate_kernel support n_share_experts_fusion param by BBuf · Pull Request #5125 · sgl-project/sglang

BBuf · 2025-04-07T10:55:20Z

Motivation

moe fused_gate_kernel support n_share_experts_fusion param

UnitTest

GSM8K benchmark

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30001

➜  sglang python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30001
100%|███████████████████████████████████████████████████████| 1319/1319 [00:56<00:00, 23.30it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 58.947 s
Output throughput: 2358.942 token/s

Benchmark In H200

main branch(fused shared experts closed )

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 --disable-shared-experts-fusion
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 4 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  128.59    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60143     
Request throughput (req/s):              2.33      
Input token throughput (tok/s):          741.08    
Output token throughput (tok/s):         469.81    
Total token throughput (tok/s):          1210.89   
Concurrency:                             62.43     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26757.12  
Median E2E Latency (ms):                 20392.68  
---------------Time to First Token----------------
Mean TTFT (ms):                          811.22    
Median TTFT (ms):                        588.20    
P99 TTFT (ms):                           3221.68   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           129.83    
Median ITL (ms):                         51.89     
P95 ITL (ms):                            620.79    
P99 ITL (ms):                            1170.19   
Max ITL (ms):                            2894.62   
==================================================

main branch(fused shared experts opened )

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 4 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  125.97    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60149     
Request throughput (req/s):              2.38      
Input token throughput (tok/s):          756.50    
Output token throughput (tok/s):         479.59    
Total token throughput (tok/s):          1236.09   
Concurrency:                             64.55     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   27102.78  
Median E2E Latency (ms):                 20786.24  
---------------Time to First Token----------------
Mean TTFT (ms):                          3336.97   
Median TTFT (ms):                        569.67    
P99 TTFT (ms):                           18203.60  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           118.92    
Median ITL (ms):                         51.17     
P95 ITL (ms):                            559.23    
P99 ITL (ms):                            652.11    
Max ITL (ms):                            22040.79  
==================================================

pr(fused shared experts opened and use moe_fused_gate kernel):

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 4 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  121.25    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60140     
Request throughput (req/s):              2.47      
Input token throughput (tok/s):          785.92    
Output token throughput (tok/s):         498.23    
Total token throughput (tok/s):          1284.15   
Concurrency:                             58.32     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23570.69  
Median E2E Latency (ms):                 18271.27  
---------------Time to First Token----------------
Mean TTFT (ms):                          2831.89   
Median TTFT (ms):                        420.14    
P99 TTFT (ms):                           19307.38  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           103.78    
Median ITL (ms):                         49.22     
P95 ITL (ms):                            253.46    
P99 ITL (ms):                            441.31    
Max ITL (ms):                            18028.75  
==================================================

Total token throughput (tok/s):
main branch(fused shared experts closed ) 1210.89
main branch(fused shared experts opened ) 1236.09
pr(fused shared experts opened and use moe_fused_gate kernel) 1284.15

…s_fusion

fzyzcjy · 2025-04-09T13:05:41Z

    for (int ii = 0; ii < topk; ++ii) {
      int64_t const idx = topk * thread_row + ii;
-      output_ptr[idx] = static_cast<float>(static_cast<T>(output_ptr[idx]) / static_cast<T>(output_sum));
+      output_ptr[idx] = output_ptr[idx] / output_sum;


wondering why we remove this static cast

output_ptr itself is of type float, and output_sum is also of type float. This static_cast is completely unnecessary.

I see, then wonder why the original code was like that...

My guess is they want to emulate some accuracy loss, but that looks pretty weird as well...

…s_fusion

fzyzcjy

briefly checked and only some nits

fzyzcjy · 2025-04-10T23:33:16Z

    num_expert_group: int = 0,
    topk_group: int = 0,
    n_share_experts_fusion: int = 0,
+    routed_scaling_factor: float = 2.5,


nit: wondering whether we should make it default as nan or None, because 2.5 is very specific to DeepSeek. If in some place of code we accidentally forget to pass in the variable, we will get a 2.5 and hard to debug. On the other hand, if we set to none, then if one day another model enables fusion but we forget to pass in value, we will get a big error, and we immediately know it is a bug from where.

fzyzcjy · 2025-04-10T23:34:02Z

    n_share_experts_fusion = 0
    if global_server_args_dict["n_share_experts_fusion"] is not None:
        n_share_experts_fusion = global_server_args_dict["n_share_experts_fusion"]
+        routed_scaling_factor = global_server_args_dict["routed_scaling_factor"]


hmm, wondering whether it is good to have it as a global variable... what about passing in as a normal variable, especially because select_experts does add this arg

fzyzcjy · 2025-04-10T23:34:55Z

    warmups: Optional[str] = None
-    n_share_experts_fusion: int = 0
+    n_share_experts_fusion: Optional[int] = None
+    routed_scaling_factor: Optional[float] = None


nit: it seems that the scaling factor will be picked by the model config, so maybe we do not need to pass in as an arg

fzyzcjy · 2025-04-10T23:36:20Z

+            )
+        elif n_share_experts_fusion is None:
+            n_share_experts_fusion = tp_size
+            global_server_args_dict["n_share_experts_fusion"] = tp_size


nit: it seems this function is impure (indeed half pure and half impure), what about making it a pure function, i.e. do not do mutations inside it to reduce cognitive overhead

fzyzcjy · 2025-04-10T23:40:54Z

+  } else if (thread_group_idx == 0) {
+    // If not using shared experts, add the last weight to output_sum
+    output_sum += output_ptr[topk * thread_row + (topk - 1)];
+  }


About ////////////////////// Topk ////////////////////// section: it seems we are still spending time computing real_topk+num_fusion things. For example, for llama4, it may be top2 instead of top1.

Thus what about

Suppose we remove name topk, and have topk_excluding_share_expert_fusion or topk_with_shared_expert_fusion etc (to avoid users getting confused which one is which one)

In the ////////////////////// Topk ////////////////////// section above, we use the topk_excluding_share_expert_fusion to compute everything (thus only top1 in llama4)

Then here we do not need this else if as well

fzyzcjy · 2025-04-10T23:42:23Z

    for (int ii = 0; ii < topk; ++ii) {
      int64_t const idx = topk * thread_row + ii;
-      output_ptr[idx] = static_cast<float>(static_cast<T>(output_ptr[idx]) / static_cast<T>(output_sum));
+      output_ptr[idx] = output_ptr[idx] / output_sum;


I see, then wonder why the original code was like that...

My guess is they want to emulate some accuracy loss, but that looks pretty weird as well...

fzyzcjy · 2025-04-10T23:43:45Z

+        valid_max = num_experts + n_share_experts_fusion
+        shared_indices = original_indices[:, -1]
+        shared_ref_indices = original_ref_indices[:, -1]
+        if shared_indices is not None:


nit: wondering when it will be none, since it looks like original_indices[:, -1] which is a tensor

fzyzcjy · 2025-04-10T23:46:32Z

@@ -57,6 +59,8 @@ __device__ void moe_fused_gate_impl(
    int64_t num_rows,
    int64_t topk_group,


question: is it mathematically incorrect, for the shared expert fusion thing, if we do topk+=n_shared_experts_fusion, but keeping topk_group unchanged? I have not digged into how it is computed, so just a quick question. If so, the quickest way to fix it may be applying the nits above by letting the topk logic compute topk_excluding_that_shared_expert_fusion instead of topk_including_that

lambert0312 · 2025-04-11T06:31:48Z

Using this latest submission, combined with my added NextN (fused shared experts opened) on dual A800 node environment, the accuracy scores are as follows:

# gsm8k
Accuracy: 0.960
Invalid: 0.000
Latency: 15.276 s
Output throughput: 1415.314 token/s

# mmlu
Total latency: 161.469
Average accuracy: 0.875

upd

c11f2da

BBuf requested review from FlamingoPg, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners April 7, 2025 10:55

BBuf changed the title ~~moe_fused_gate_kernel_support_n_share_experts_fusion~~ moe fused_gate_kernel support n_share_experts_fusion Apr 7, 2025

BBuf changed the title ~~moe fused_gate_kernel support n_share_experts_fusion~~ moe fused_gate_kernel support n_share_experts_fusion param Apr 7, 2025

BBuf added 2 commits April 8, 2025 07:33

fix syncthreads

a96f9da

gsm8k test

657f57f

BBuf requested review from ByronHsu, Ying1123 and hnyls2002 as code owners April 8, 2025 15:12

BBuf and others added 10 commits April 8, 2025 23:39

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

d426c7d

…s_fusion

format

00c0d65

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

b9efd6c

…s_fusion

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

372536d

…s_fusion

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

d531276

…s_fusion

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

6f9a7d5

…s_fusion

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

6a78377

…s_fusion

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

7431d46

…s_fusion

upd

491f522

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

ee88247

…s_fusion

fzyzcjy reviewed Apr 9, 2025

View reviewed changes

upd

3ec0238

BBuf requested review from HaiShaw and xiezhq-hermann as code owners April 10, 2025 07:37

BBuf added 2 commits April 10, 2025 10:16

upd

ab9a144

upd

75a686b

BBuf and others added 3 commits April 10, 2025 10:18

upd

dd2ad67

lint

5097fa9

Merge branch 'main' into moe_fused_gate_kernel_support_n_share_expert…

8d01de1

…s_fusion

fzyzcjy reviewed Apr 10, 2025

View reviewed changes

lambert0312 mentioned this pull request Apr 14, 2025

apply fused moe gate in ds v3/r1 #5371

Merged

BBuf closed this Apr 17, 2025

zhyncs deleted the moe_fused_gate_kernel_support_n_share_experts_fusion branch April 20, 2025 04:57

		@@ -57,6 +59,8 @@ __device__ void moe_fused_gate_impl(
		int64_t num_rows,
		int64_t topk_group,

Conversation

BBuf commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

UnitTest

GSM8K benchmark

Benchmark In H200

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lambert0312 commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BBuf commented Apr 7, 2025 •

edited

Loading

fzyzcjy Apr 10, 2025 •

edited

Loading