Add moe topk softmax templated from vllm by qingquansong · Pull Request #4302 · sgl-project/sglang

qingquansong · 2025-03-11T08:26:48Z

Motivation

#2965

Modifications

Cherry picked current vllm MoE topk softmax kernel template (with a fix on naming typo for token_expert_indices)
Polish util func warpReduceMax / blockReduceMax for handle AMD use case as well.

Tests

Unit tests + benchmarking aligned with vllm counterpart

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yiakwy-xpu-ml-framework-team · 2025-03-15T03:57:38Z

Hi @qingquansong

once this PR #4432 is merged

#ifndef USE_ROCM
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask, width)
#else
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) __shfl_xor(var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor(var, lane_mask, width)
#endif

you can use __shfl_xor_sync directly. hence no need to have these lines.

Does it sound good to you ?

yiakwy-xpu-ml-framework-team · 2025-03-15T03:59:10Z

-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 4));
-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 2));
-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 1));
+  max_value = fmaxf(max_value, SGLANG_SHFL_XOR_SYNC(0xffffffff, max_value, 16));


The modification to these function will be revert in #4432

cc @zhyncs

yiakwy-xpu-ml-framework-team · 2025-03-15T04:06:46Z

+#else
+#define SGLANG_SHFL_XOR_SYNC(mask, var, lane_mask) __shfl_xor((var), (lane_mask))
+#define SGLANG_SHFL_XOR_SYNC_WIDTH(mask, var, lane_mask, width) __shfl_xor((var), (lane_mask), (width))
+#endif


I will keep these lines since you have the ROCM specific macro (since CUDA operation is no longer safe if we employ this approach) in many places. But #4432 is merged. The macro is no longer needed.

Thank you! I'll change those and remove the definition.

yiakwy-xpu-ml-framework-team · 2025-03-15T04:29:49Z

+
+  const int thread_row_offset = blockIdx.x * num_cols;
+
+  cub::Sum sum;


hipCUB is experimental one, we can try it. But it introduces new dependencies.

Could just use some simple reduction kernel ?

Definitely. We can change to customized reductions for both max and sum. I'll do it together with the macro change in a follow-up pr. How about the following one? I can test the correctness on CUDA and may need your help for AMD machine testing.

__device__ __forceinline__ float warpReduceSum(float sum_value) { sum_value += __shfl_xor_sync(0xffffffff, sum_value, 16); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 8); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 4); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 2); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 1); return sum_value; } __device__ __forceinline__ float blockReduceSum(float sum_value) { static __shared__ float warpLevelSums[WARP_SIZE]; const int laneId = threadIdx.x % WARP_SIZE; const int warpId = threadIdx.x / WARP_SIZE; sum_value = warpReduceSum(sum_value); if (laneId == 0) warpLevelSums[warpId] = sum_value; __syncthreads(); sum_value = (threadIdx.x < blockDim.x / WARP_SIZE) ? warpLevelSums[laneId] : 0; if (warpId == 0) sum_value = warpReduceSum(sum_value); return sum_value; }

resovled in #4448

But also I recommend to use shlf_xor based implementation. The old solution from fasterTransformer(later incorporated into TRT-LLM) uses heavily shared memory for reduction:

const float maxElem = BlockReduce(tmpStorage).Reduce(threadData, cub::Max());

WIth shlf_xor based implementation, then you can get better result.

@qingquansong

qingquansong · 2025-03-15T04:31:41Z

Hi @qingquansong

once this PR #4432 is merged

#ifndef USE_ROCM
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask, width)
#else
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) __shfl_xor(var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor(var, lane_mask, width)
#endif

you can use __shfl_xor_sync directly. hence no need to have these lines.

Does it sound good to you ?

Sounds great! I'll remove the marco definition and change back to use __shfl_xor_sync directly

qingquansong requested review from BBuf, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners March 11, 2025 08:26

qingquansong marked this pull request as draft March 11, 2025 08:26

hebiao064 mentioned this pull request Mar 11, 2025

[Feature] remove vllm _custom_ops #2965

Closed

18 tasks

qingquansong force-pushed the qsong/moe-topk-softmax branch 11 times, most recently from b70667a to 6200f4a Compare March 12, 2025 07:11

add moe topk softmax templated from vllm to improve

2e0cf1d

qingquansong force-pushed the qsong/moe-topk-softmax branch 2 times, most recently from 2786ad3 to acd4fb7 Compare March 13, 2025 04:34

Merge branch 'main' into qsong/moe-topk-softmax

97f4eb0

qingquansong force-pushed the qsong/moe-topk-softmax branch from acd4fb7 to 97f4eb0 Compare March 13, 2025 04:35

Merge branch 'main' into qsong/moe-topk-softmax

5b95e7d

qingquansong changed the title ~~add moe topk softmax templated from vllm to improve~~ Add moe topk softmax templated from vllm Mar 13, 2025

qingquansong added 2 commits March 13, 2025 21:57

cleanup

d7782e3

Merge branch 'main' into qsong/moe-topk-softmax

ddcafd7

qingquansong marked this pull request as ready for review March 13, 2025 21:59

hebiao064 reviewed Mar 13, 2025

View reviewed changes

Comment thread sgl-kernel/include/utils.h Outdated

qingquansong commented Mar 13, 2025

View reviewed changes

Comment thread sgl-kernel/include/utils.h Outdated

qingquansong added 2 commits March 13, 2025 16:38

Merge branch 'main' into qsong/moe-topk-softmax

81da703

reformat

0a3ff93

qingquansong force-pushed the qsong/moe-topk-softmax branch 2 times, most recently from d1b7bb2 to 6b28b88 Compare March 14, 2025 03:28

Merge branch 'main' into qsong/moe-topk-softmax

2fa0db7

qingquansong force-pushed the qsong/moe-topk-softmax branch from 6b28b88 to 2fa0db7 Compare March 14, 2025 03:29

qingquansong and others added 2 commits March 13, 2025 20:44

Merge branch 'main' into qsong/moe-topk-softmax

028f527

Merge branch 'main' into qsong/moe-topk-softmax

8a2edee

BBuf reviewed Mar 14, 2025

View reviewed changes

Comment thread sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu

qingquansong added 2 commits March 14, 2025 11:19

Merge branch 'main' into qsong/moe-topk-softmax

61e9609

resolve comments

c15a211

qingquansong force-pushed the qsong/moe-topk-softmax branch from c77f195 to c15a211 Compare March 14, 2025 18:37

Merge branch 'main' into qsong/moe-topk-softmax

92f8dde

qingquansong requested review from BBuf and hebiao064 March 14, 2025 18:41

zhyncs merged commit 61e4433 into sgl-project:main Mar 14, 2025

yiakwy-xpu-ml-framework-team reviewed Mar 15, 2025

View reviewed changes

yiakwy-xpu-ml-framework-team mentioned this pull request Mar 15, 2025

[ROCm] enable moe topk softmax in amd #4448

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add moe topk softmax templated from vllm#4302

Add moe topk softmax templated from vllm#4302
zhyncs merged 13 commits intosgl-project:mainfrom
qingquansong:qsong/moe-topk-softmax

qingquansong commented Mar 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

qingquansong Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Uh oh!

qingquansong Mar 15, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

qingquansong commented Mar 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		const int thread_row_offset = blockIdx.x * num_cols;

		cub::Sum sum;

Conversation

qingquansong commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

qingquansong Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qingquansong commented Mar 11, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

qingquansong Mar 15, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

qingquansong commented Mar 15, 2025 •

edited

Loading