Skip to content

Optimize moe align block size kernel#7794

Merged
ispobock merged 6 commits intosgl-project:mainfrom
ispobock:opt-moe-align
Jul 7, 2025
Merged

Optimize moe align block size kernel#7794
ispobock merged 6 commits intosgl-project:mainfrom
ispobock:opt-moe-align

Conversation

@ispobock
Copy link
Copy Markdown
Collaborator

@ispobock ispobock commented Jul 5, 2025

Motivation

  • Optimize prefix sum with Blelloch scan (O(logN))
  • Reduce global memory access

main:

    num_tokens  num_experts  topk        SGL  SGL Fusion      Triton
0          1.0        128.0   1.0  22.688000   20.624001   32.384001
1          1.0        128.0   2.0  22.688000   20.800000   32.512002
2          1.0        128.0   4.0  22.688000   20.768000   32.543998
3          1.0        128.0   8.0  22.528000   20.640001   32.416001
4          1.0        256.0   1.0  25.823999   23.808001   46.208002
5          1.0        256.0   2.0  25.855999   23.936000   46.239998
6          1.0        256.0   4.0  25.888000   23.903999   46.144001
7          1.0        256.0   8.0  25.760001   23.903999   46.239998
8          8.0        128.0   1.0  22.528000   20.640001   32.543998
9          8.0        128.0   2.0  22.560000   20.640001   32.639999
10         8.0        128.0   4.0  22.560000   20.608000   32.671999
11         8.0        128.0   8.0  22.624001   20.959999   32.960001
12         8.0        256.0   1.0  25.855999   24.192000   46.303999
13         8.0        256.0   2.0  25.792001   24.127999   46.271998
14         8.0        256.0   4.0  25.599999   24.064001   46.399999
15         8.0        256.0   8.0  25.823999   24.160000   46.528000
16        16.0        128.0   1.0  22.560000   20.703999   32.607999
17        16.0        128.0   2.0  22.560000   20.640001   32.639999
18        16.0        128.0   4.0  22.624001   20.864001   32.864001
19        16.0        128.0   8.0  22.560000   21.120001   32.896001
20        16.0        256.0   1.0  25.760001   24.032000   46.367999
21        16.0        256.0   2.0  25.760001   24.127999   46.496000
22        16.0        256.0   4.0  26.016001   24.288001   46.560001
23        16.0        256.0   8.0  26.016001   24.704000   46.688002
24        32.0        128.0   1.0  22.560000   20.640001   32.575998
25        32.0        128.0   2.0  22.655999   20.864001   32.864001
26        32.0        128.0   4.0  22.592001   21.056000   32.928001
27        32.0        128.0   8.0  22.784000   21.504000   34.208000
28        32.0        256.0   1.0  25.792001   24.127999   46.239998
29        32.0        256.0   2.0  25.664000   24.192000   46.496000
30        32.0        256.0   4.0  25.920000   24.607999   46.432000
31        32.0        256.0   8.0  26.016001   25.376000   45.568001
32        64.0        128.0   1.0  22.624001   20.864001   33.008000
33        64.0        128.0   2.0  22.655999   21.183999   32.896001
34        64.0        128.0   4.0  22.848001   21.600001   34.143999
35        64.0        128.0   8.0  22.592001   21.600001   35.232000
36        64.0        256.0   1.0  25.823999   24.160000   46.464000
37        64.0        256.0   2.0  25.920000   24.576001   46.528000
38        64.0        256.0   4.0  25.855999   25.152000   45.600001
39        64.0        256.0   8.0  25.855999   25.696000  109.407999
40       128.0        128.0   1.0  22.528000   21.024000   33.728000
41       128.0        128.0   2.0  22.720000   21.536000   34.400001
42       128.0        128.0   4.0  22.560000   21.600001   35.615999
43       128.0        128.0   8.0  22.528000   21.472000   36.448002
44       128.0        256.0   1.0  25.920000   24.607999   46.560001
45       128.0        256.0   2.0  26.048001   25.248000   45.600001
46       128.0        256.0   4.0  25.855999   25.696000   46.432000
47       128.0        256.0   8.0  26.016001   25.952000   48.191998
48       256.0        128.0   1.0  22.879999   21.504000   34.208000
49       256.0        128.0   2.0  22.560000   21.504000   35.135999
50       256.0        128.0   4.0  22.592001   21.472000   36.288001
51       256.0        128.0   8.0  23.552001   22.496000   39.584000
52       256.0        256.0   1.0  26.048001   25.312001   45.600001
53       256.0        256.0   2.0  25.855999   25.599999   46.319999
54       256.0        256.0   4.0  25.760001   25.823999   48.128001
55       256.0        256.0   8.0  27.008001   26.944000   49.568001
56       512.0        128.0   1.0  22.592001   21.568000   35.135999
57       512.0        128.0   2.0  22.592001   21.568000   36.479998
58       512.0        128.0   4.0  23.536000   22.528000   39.391998
59       512.0        128.0   8.0  24.672000   23.488000   43.136001
60       512.0        256.0   1.0  26.048001   25.760001   46.528000
61       512.0        256.0   2.0  25.984000   25.984000   48.896000
62       512.0        256.0   4.0  27.008001   27.008001   49.663998
63       512.0        256.0   8.0  28.016000   28.160000   53.824000
64      1024.0        128.0   1.0  22.560000   21.504000   36.384001
65      1024.0        128.0   2.0  23.552001   22.496000   39.519999
66      1024.0        128.0   4.0  24.639999   23.615999   43.072000
67      1024.0        128.0   8.0  26.944000   25.920000   49.472000
68      1024.0        256.0   1.0  25.888000   25.984000   47.839999
69      1024.0        256.0   2.0  26.976001   26.976001   49.695998
70      1024.0        256.0   4.0  28.192000   28.192000   53.952001
71      1024.0        256.0   8.0  30.624000   30.688001   59.583999
72      2048.0        128.0   1.0  23.552001   22.464000   39.648000
73      2048.0        128.0   2.0  24.576001   23.520000   43.488000
74      2048.0        128.0   4.0  27.071999   26.048001   49.568001
75      2048.0        128.0   8.0  31.711999   31.008000   61.087999
76      2048.0        256.0   1.0  26.976001   26.944000   49.568001
77      2048.0        256.0   2.0  28.192000   28.160000   53.920001
78      2048.0        256.0   4.0  30.624000   30.784000   59.551999
79      2048.0        256.0   8.0  35.936002   36.095999   68.768002
80      4096.0        128.0   1.0  24.672000   23.615999   42.975999
81      4096.0        128.0   2.0  27.008001   25.984000   49.440000
82      4096.0        128.0   4.0  31.711999   31.199999   61.184000
83      4096.0        128.0   8.0  41.536000   41.855998   83.711997
84      4096.0        256.0   1.0  28.192000   28.192000   53.984001
85      4096.0        256.0   2.0  30.592000   30.784000   59.776001
86      4096.0        256.0   4.0  36.224000   36.160000   68.752002
87      4096.0        256.0   8.0  42.495999   43.680001   80.735996
88      8192.0        128.0   1.0  27.008001   25.888000   49.279999
89      8192.0        128.0   2.0  31.615999   31.199999   61.184000
90      8192.0        128.0   4.0  41.503999   41.760001   84.576003
91      8192.0        128.0   8.0  59.967998   62.560000  130.736001
92      8192.0        256.0   1.0  30.560000   30.751999   59.615999
93      8192.0        256.0   2.0  35.840001   36.160000   68.383999
94      8192.0        256.0   4.0  42.399999   43.488000   80.895998
95      8192.0        256.0   8.0  58.816001   62.208001  105.632000

This PR:

    num_tokens  num_experts  topk        SGL  SGL Fusion      Triton
0          1.0        128.0   1.0  20.576000   18.751999   31.776000
1          1.0        128.0   2.0  20.671999   18.751999   31.711999
2          1.0        128.0   4.0  20.479999   18.719999   31.776000
3          1.0        128.0   8.0  20.544000   18.751999   31.744000
4          1.0        256.0   1.0  20.896001   19.168001   45.536000
5          1.0        256.0   2.0  20.864001   19.168001   45.472000
6          1.0        256.0   4.0  20.864001   19.200001   45.440000
7          1.0        256.0   8.0  20.896001   19.168001   45.664001
8          8.0        128.0   1.0  20.544000   18.751999   31.776000
9          8.0        128.0   2.0  20.576000   18.880000   32.063998
10         8.0        128.0   4.0  20.671999   18.912001   32.352000
11         8.0        128.0   8.0  20.640001   18.912001   32.320000
12         8.0        256.0   1.0  20.896001   19.264000   45.616001
13         8.0        256.0   2.0  20.864001   19.168001   45.600001
14         8.0        256.0   4.0  20.864001   19.296000   45.728002
15         8.0        256.0   8.0  20.896001   19.520000   45.887999
16        16.0        128.0   1.0  20.544000   18.848000   31.936001
17        16.0        128.0   2.0  20.544000   18.880000  377.920002
18        16.0        128.0   4.0  20.608000   19.072000   32.288000
19        16.0        128.0   8.0  20.560000   19.264000   32.384001
20        16.0        256.0   1.0  20.896001   19.168001   45.696001
21        16.0        256.0   2.0  20.992000   19.328000   45.855999
22        16.0        256.0   4.0  21.008000   19.552000   46.048000
23        16.0        256.0   8.0  20.992000   19.840000   46.048000
24        32.0        128.0   1.0  20.576000   18.816000   32.143999
25        32.0        128.0   2.0  20.608000   18.912001   32.288000
26        32.0        128.0   4.0  20.576000   19.264000   32.480001
27        32.0        128.0   8.0  20.800000   19.487999   33.599999
28        32.0        256.0   1.0  20.959999   19.296000   45.855999
29        32.0        256.0   2.0  20.896001   19.584000   45.936000
30        32.0        256.0   4.0  20.992000   19.808000   45.919999
31        32.0        256.0   8.0  21.152001   20.352000   45.216002
32        64.0        128.0   1.0  20.703999   19.072000   32.224000
33        64.0        128.0   2.0  20.608000   19.200001   32.352000
34        64.0        128.0   4.0  20.800000   19.680001   33.504002
35        64.0        128.0   8.0  20.640001   19.648001   34.527998
36        64.0        256.0   1.0  20.959999   19.552000   45.823999
37        64.0        256.0   2.0  21.056000   19.840000   45.919999
38        64.0        256.0   4.0  21.120001   20.384001   45.024000
39        64.0        256.0   8.0  20.992000   20.640001   45.696001
40       128.0        128.0   1.0  20.671999   19.200001   32.352000
41       128.0        128.0   2.0  20.736000   19.487999   33.599999
42       128.0        128.0   4.0  21.024000   19.600000   34.559999
43       128.0        128.0   8.0  20.832000   19.552000   35.999998
44       128.0        256.0   1.0  21.152001   19.936001   46.016000
45       128.0        256.0   2.0  21.152001   20.288000   45.024000
46       128.0        256.0   4.0  21.056000   20.703999   45.504000
47       128.0        256.0   8.0  21.024000   20.800000   47.456000
48       256.0        128.0   1.0  20.800000   19.487999   33.440001
49       256.0        128.0   2.0  20.640001   19.584000   34.720000
50       256.0        128.0   4.0  20.640001   19.648001   35.904001
51       256.0        128.0   8.0  21.695999   20.512000   39.104000
52       256.0        256.0   1.0  21.152001   20.352000   45.184001
53       256.0        256.0   2.0  21.120001   20.736000   46.239998
54       256.0        256.0   4.0  21.136001   20.640001   77.087998
55       256.0        256.0   8.0  22.175999   98.272003   49.263999
56       512.0        128.0   1.0  20.608000   19.552000   34.591999
57       512.0        128.0   2.0  20.703999   19.584000   35.872001
58       512.0        128.0   4.0  21.663999   20.544000   39.487999
59       512.0        128.0   8.0  22.752000   21.632001   42.447999
60       512.0        256.0   1.0  21.120001   20.608000   45.728002
61       512.0        256.0   2.0  21.120001   20.832000   47.504000
62       512.0        256.0   4.0  22.175999   22.112001   49.152002
63       512.0        256.0   8.0  23.360001   23.167999   53.360000
64      1024.0        128.0   1.0  20.671999   19.648001   35.936002
65      1024.0        128.0   2.0  21.568000   20.512000   38.943999
66      1024.0        128.0   4.0  22.720000   21.600001   42.399999
67      1024.0        128.0   8.0  24.992000   23.936000   48.512001
68      1024.0        256.0   1.0  21.024000   20.800000   47.488000
69      1024.0        256.0   2.0  22.143999   21.919999   48.960000
70      1024.0        256.0   4.0  23.328001   23.200000   53.344000
71      1024.0        256.0   8.0  25.760001   25.664000   58.975998
72      2048.0        128.0   1.0  21.648000   20.703999   39.071999
73      2048.0        128.0   2.0  22.816001   21.663999   42.624000
74      2048.0        128.0   4.0  25.024001   23.968000   48.799999
75      2048.0        128.0   8.0  29.664000   28.896000   60.479999
76      2048.0        256.0   1.0  22.175999   22.016000   48.928000
77      2048.0        256.0   2.0  23.328001   23.167999   53.376000
78      2048.0        256.0   4.0  25.792001   25.680000   58.688000
79      2048.0        256.0   8.0  31.072000   31.136001   67.936003
80      4096.0        128.0   1.0  22.736000   21.728000   42.495999
81      4096.0        128.0   2.0  24.960000   23.936000   48.815999
82      4096.0        128.0   4.0  29.727999   28.991999   59.840001
83      4096.0        128.0   8.0  39.360002   39.680000   83.584003
84      4096.0        256.0   1.0  23.328001   23.264000   53.536002
85      4096.0        256.0   2.0  25.920000   25.855999   59.360001
86      4096.0        256.0   4.0  31.199999   31.008000   67.263998
87      4096.0        256.0   8.0  37.439998   38.336001   80.256000
88      8192.0        128.0   1.0  24.960000   24.000000   48.735999
89      8192.0        128.0   2.0  29.568000   28.960001   60.160000
90      8192.0        128.0   4.0  39.296001   39.680000   83.839998
91      8192.0        128.0   8.0  57.760000   60.192000  129.664004
92      8192.0        256.0   1.0  25.792001   25.696000   58.848001
93      8192.0        256.0   2.0  31.072000   31.072000   68.127997
94      8192.0        256.0   4.0  37.503999   38.400002   79.903997
95      8192.0        256.0   8.0  54.496001   57.567999  106.527999

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ispobock, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial optimization to the moe_align_block_size_kernel, a critical component for Mixture-of-Experts (MoE) models. The core improvement involves replacing the existing prefix sum computation with a highly efficient Blelloch scan algorithm, which is designed to minimize global memory access and enhance parallel processing on the GPU. These changes result in significant performance improvements across a range of configurations, making the MoE alignment process faster and more efficient.

Highlights

  • Kernel Optimization: The moe_align_block_size_kernel has been significantly optimized by replacing its prefix sum calculation with a more efficient Blelloch scan algorithm. This change aims to reduce global memory access and improve performance.
  • Shared Memory Management: The kernel's shared memory usage has been re-architected to support the new Blelloch scan, including dedicated buffers for counts, prefixes, and scan operations, leading to more efficient resource utilization.
  • Data Type Simplification: The custom AlignedArray struct for vector types has been removed in favor of directly using int4, simplifying the code and potentially leveraging native vector operations for better performance.
  • Performance Improvements: The changes yield notable performance gains, particularly for the 'SGL Fusion' and 'SGL' metrics across various num_tokens, num_experts, and topk configurations, as evidenced by the provided benchmarks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the moe_align_block_size_kernel with a parallel Blelloch scan and improved load balancing. A minor issue was identified in the shared memory calculation.

Comment thread sgl-kernel/csrc/moe/moe_align_kernel.cu Outdated
Comment thread sgl-kernel/csrc/moe/moe_align_kernel.cu Outdated
Comment thread sgl-kernel/csrc/moe/moe_align_kernel.cu Outdated
Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. Please add end-to-end accuracy testing for DeepSeek V3.

Comment thread sgl-kernel/csrc/moe/moe_align_kernel.cu
@Alcanderian
Copy link
Copy Markdown
Collaborator

And maybe block_scan_warp_scan implementation will be better for specify num_experts, we can try it next time

@ispobock
Copy link
Copy Markdown
Collaborator Author

ispobock commented Jul 6, 2025

Tested on Qwen/Qwen3-235B-A22B-FP8, the accuracy is correct:

Accuracy: 0.952
Invalid: 0.000
Latency: 106.039 s
Output throughput: 1731.616 token/s

@ispobock ispobock enabled auto-merge (squash) July 6, 2025 16:50
@ispobock ispobock merged commit a3398d8 into sgl-project:main Jul 7, 2025
92 of 98 checks passed
@ispobock ispobock deleted the opt-moe-align branch July 7, 2025 01:20
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Jul 9, 2025

And maybe block_scan_warp_scan implementation will be better for specify num_experts, we can try it next time

#7884

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants