Skip to content

[Refactor] Optimize select_experts#28069

Merged
mgoin merged 13 commits intomainfrom
wentao-optimize-select-experts
Nov 19, 2025
Merged

[Refactor] Optimize select_experts#28069
mgoin merged 13 commits intomainfrom
wentao-optimize-select-experts

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Nov 4, 2025

Purpose

Several small optimizations including:

  • avoid mul if routed_scaling_factor==1
  • avoid duplicate dtype cast when eplb is enabled

Test

I don't see too much e2e perf improvement since these are small optimizations

Doesn't hurt accuracy:

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-R1,num_concurrent=1024" --tasks gsm8k
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9507|±  | 0.006|
|     |       |strict-match    |     5|exact_match||0.9500|±  | 0.006|

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 4, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2237 to 2238
if routed_scaling_factor != 1.0:
topk_weights *= routed_scaling_factor
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard against None routed_scaling_factor

The new check multiplies topk_weights whenever routed_scaling_factor != 1.0. Several models set routed_scaling_factor=None (e.g. longcat flash) while still using an e_score_correction_bias. In that case None != 1.0 evaluates to True and this branch executes, causing topk_weights *= None to raise a TypeError during routing. The previous code only multiplied when the scaling factor was not None. Consider keeping the None guard (e.g. if routed_scaling_factor and routed_scaling_factor != 1.0:) so configurations that disable routing scaling continue to run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several small but effective optimizations to the select_experts function. It avoids an unnecessary multiplication when routed_scaling_factor is 1.0, expands the usage of the faster fused_grouped_topk kernel to cases where e_score_correction_bias is not present, and refactors the code to eliminate a duplicate data type cast when expert parallelism load balancing (EPLB) is enabled. These changes are well-implemented and contribute to improving performance and code clarity. The logic appears sound, and I don't see any issues with the proposed changes.

yewentao256 and others added 2 commits November 4, 2025 14:36
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Comment on lines +1332 to +1341
# allow None bias by substituting a zero vector
bias = (
e_score_correction_bias.to(gating_output.dtype)
if e_score_correction_bias is not None
else torch.zeros(
gating_output.shape[-1],
device=gating_output.device,
dtype=gating_output.dtype,
)
)
Copy link
Copy Markdown
Member

@mgoin mgoin Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just update the kernel to allow for bias to be None, rather than allocating?

Copy link
Copy Markdown
Member Author

@yewentao256 yewentao256 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried but found that it involves lots of cuda kernel changes, I don't see too much benefits here as well, just rolled back the update for e_score_correction_bias @mgoin

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
renormalize=renormalize,
)
if routed_scaling_factor is not None:
if routed_scaling_factor != 1.0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you gain by changing this? It seems there are still other places that set routed_scaling_factor to None, and you could just as well update the models that don't to pass in None instead of 1.0
https://github.com/search?q=repo%3Avllm-project%2Fvllm%20routed_scaling_factor%3DNone&type=code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated this None to 1.0 by default.
This is beneficial since we use a lot of 1.0 somewhere else https://github.com/search?q=repo%3Avllm-project%2Fvllm+routed_scaling_factor%3D1.0&type=code
And we avoid a useless mul operator through doing this

Comment on lines +2270 to +2271
if (indices_type is not None) and topk_ids.dtype != indices_type:
topk_ids = topk_ids.to(dtype=indices_type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is valid since other conditionals above use indices_type..

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the param in eplb_map_to_physical_and_record since it is just for a topk_ids.to(dtype=indices_type) and now we can see it more clearly.

The reason we update here is to avoid a duplicate topk_ids.to(dtype=indices_type). Eg. when using use_grouped_topk, we already topk_ids.to(dtype=indices_type), then we pass the indices_type to eplb_map_to_physical_and_record if eplb is enabled, we cast again.

update

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yewentao256 It looks like _eplb_map_to_physical_and_record needs to be updated as well in this file for the cpu backend

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mgoin mgoin merged commit 5031cd5 into main Nov 19, 2025
62 checks passed
@mgoin mgoin deleted the wentao-optimize-select-experts branch November 19, 2025 23:53
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants