[Refactor] Optimize `select_experts` by yewentao256 · Pull Request #28069 · vllm-project/vllm

yewentao256 · 2025-11-04T21:54:43Z

Purpose

Several small optimizations including:

avoid mul if routed_scaling_factor==1
avoid duplicate dtype cast when eplb is enabled

Test

I don't see too much e2e perf improvement since these are small optimizations

Doesn't hurt accuracy:

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-R1,num_concurrent=1024" --tasks gsm8k
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9507|±  | 0.006|
|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  | 0.006|

Signed-off-by: yewentao256 <zhyanwentao@126.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-04T21:57:28Z

vllm/model_executor/layers/fused_moe/layer.py

+            if routed_scaling_factor != 1.0:
                topk_weights *= routed_scaling_factor


Guard against None routed_scaling_factor

The new check multiplies topk_weights whenever routed_scaling_factor != 1.0. Several models set routed_scaling_factor=None (e.g. longcat flash) while still using an e_score_correction_bias. In that case None != 1.0 evaluates to True and this branch executes, causing topk_weights *= None to raise a TypeError during routing. The previous code only multiplied when the scaling factor was not None. Consider keeping the None guard (e.g. if routed_scaling_factor and routed_scaling_factor != 1.0:) so configurations that disable routing scaling continue to run.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces several small but effective optimizations to the select_experts function. It avoids an unnecessary multiplication when routed_scaling_factor is 1.0, expands the usage of the faster fused_grouped_topk kernel to cases where e_score_correction_bias is not present, and refactors the code to eliminate a duplicate data type cast when expert parallelism load balancing (EPLB) is enabled. These changes are well-implemented and contribute to improving performance and code clarity. The logic appears sound, and I don't see any issues with the proposed changes.

vllm/model_executor/layers/fused_moe/fused_moe.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin · 2025-11-10T18:55:49Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+    # allow None bias by substituting a zero vector
+    bias = (
+        e_score_correction_bias.to(gating_output.dtype)
+        if e_score_correction_bias is not None
+        else torch.zeros(
+            gating_output.shape[-1],
+            device=gating_output.device,
+            dtype=gating_output.dtype,
+        )
+    )


Why don't we just update the kernel to allow for bias to be None, rather than allocating?

Tried but found that it involves lots of cuda kernel changes, I don't see too much benefits here as well, just rolled back the update for e_score_correction_bias @mgoin

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin · 2025-11-17T21:34:37Z

vllm/model_executor/layers/fused_moe/layer.py

                renormalize=renormalize,
            )
-            if routed_scaling_factor is not None:
+            if routed_scaling_factor != 1.0:


What do you gain by changing this? It seems there are still other places that set routed_scaling_factor to None, and you could just as well update the models that don't to pass in None instead of 1.0
https://github.com/search?q=repo%3Avllm-project%2Fvllm%20routed_scaling_factor%3DNone&type=code

Thanks! Updated this None to 1.0 by default.
This is beneficial since we use a lot of 1.0 somewhere else https://github.com/search?q=repo%3Avllm-project%2Fvllm+routed_scaling_factor%3D1.0&type=code
And we avoid a useless mul operator through doing this

mgoin · 2025-11-17T21:38:32Z

vllm/model_executor/layers/fused_moe/layer.py

+        if (indices_type is not None) and topk_ids.dtype != indices_type:
+            topk_ids = topk_ids.to(dtype=indices_type)


I don't know if this is valid since other conditionals above use indices_type..

I remove the param in eplb_map_to_physical_and_record since it is just for a topk_ids.to(dtype=indices_type) and now we can see it more clearly.

The reason we update here is to avoid a duplicate topk_ids.to(dtype=indices_type). Eg. when using use_grouped_topk, we already topk_ids.to(dtype=indices_type), then we pass the indices_type to eplb_map_to_physical_and_record if eplb is enabled, we cast again.

Signed-off-by: yewentao256 <zhyanwentao@126.com>

update Signed-off-by: yewentao256 <zhyanwentao@126.com>

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin · 2025-11-19T18:00:24Z

vllm/model_executor/layers/fused_moe/fused_moe.py

@yewentao256 It looks like _eplb_map_to_physical_and_record needs to be updated as well in this file for the cpu backend

fixed, thanks!

Signed-off-by: yewentao256 <zhyanwentao@126.com>

optimize select experts

04d5558

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from mgoin and pavanimajety as code owners November 4, 2025 21:54

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 4, 2025

chatgpt-codex-connector bot reviewed Nov 4, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

robertgshaw2-redhat reviewed Nov 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Show resolved Hide resolved

yewentao256 and others added 2 commits November 4, 2025 14:36

default 1.0

207109a

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-optimize-select-experts

4731e66

robertgshaw2-redhat mentioned this pull request Nov 5, 2025

[Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1 #28084

Closed

1 task

Merge branch 'main' into wentao-optimize-select-experts

6d59744

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin requested changes Nov 10, 2025

View reviewed changes

yewentao256 added 2 commits November 10, 2025 14:55

roll back

4c07e82

Signed-off-by: yewentao256 <zhyanwentao@126.com>

rollback

2dbde8d

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin reviewed Nov 17, 2025

View reviewed changes

yewentao256 added 2 commits November 17, 2025 13:44

Merge branch 'main' into wentao-optimize-select-experts

d4f5c89

update using 1.0 by default

23123b1

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from WoosukKwon and tlrmchlsmth as code owners November 17, 2025 21:47

def eplb_map_to_physical_and_record(

ae46a76

update Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin reviewed Nov 17, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

yewentao256 and others added 3 commits November 17, 2025 15:10

fix comment

4f920db

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-optimize-select-experts

ac770b5

Merge branch 'main' into wentao-optimize-select-experts

7e0ba8a

mgoin reviewed Nov 19, 2025

View reviewed changes

fix bug in function signature

6692db1

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin approved these changes Nov 19, 2025

View reviewed changes

mgoin merged commit 5031cd5 into main Nov 19, 2025
62 checks passed

mgoin deleted the wentao-optimize-select-experts branch November 19, 2025 23:53

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Refactor] Optimize select_experts (vllm-project#28069)

e148438

Signed-off-by: yewentao256 <zhyanwentao@126.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Refactor] Optimize select_experts (vllm-project#28069)

37cdc38

Signed-off-by: yewentao256 <zhyanwentao@126.com>

		if routed_scaling_factor != 1.0:
		topk_weights *= routed_scaling_factor

		if (indices_type is not None) and topk_ids.dtype != indices_type:
		topk_ids = topk_ids.to(dtype=indices_type)

Uh oh!

Conversation

yewentao256 commented Nov 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgoin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yewentao256 commented Nov 4, 2025 •

edited by github-actions bot

Loading

mgoin Nov 10, 2025 •

edited

Loading

yewentao256 Nov 10, 2025 •

edited

Loading