Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if routed_scaling_factor != 1.0: | ||
| topk_weights *= routed_scaling_factor |
There was a problem hiding this comment.
Guard against None routed_scaling_factor
The new check multiplies topk_weights whenever routed_scaling_factor != 1.0. Several models set routed_scaling_factor=None (e.g. longcat flash) while still using an e_score_correction_bias. In that case None != 1.0 evaluates to True and this branch executes, causing topk_weights *= None to raise a TypeError during routing. The previous code only multiplied when the scaling factor was not None. Consider keeping the None guard (e.g. if routed_scaling_factor and routed_scaling_factor != 1.0:) so configurations that disable routing scaling continue to run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces several small but effective optimizations to the select_experts function. It avoids an unnecessary multiplication when routed_scaling_factor is 1.0, expands the usage of the faster fused_grouped_topk kernel to cases where e_score_correction_bias is not present, and refactors the code to eliminate a duplicate data type cast when expert parallelism load balancing (EPLB) is enabled. These changes are well-implemented and contribute to improving performance and code clarity. The logic appears sound, and I don't see any issues with the proposed changes.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
| # allow None bias by substituting a zero vector | ||
| bias = ( | ||
| e_score_correction_bias.to(gating_output.dtype) | ||
| if e_score_correction_bias is not None | ||
| else torch.zeros( | ||
| gating_output.shape[-1], | ||
| device=gating_output.device, | ||
| dtype=gating_output.dtype, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Why don't we just update the kernel to allow for bias to be None, rather than allocating?
There was a problem hiding this comment.
Tried but found that it involves lots of cuda kernel changes, I don't see too much benefits here as well, just rolled back the update for e_score_correction_bias @mgoin
| renormalize=renormalize, | ||
| ) | ||
| if routed_scaling_factor is not None: | ||
| if routed_scaling_factor != 1.0: |
There was a problem hiding this comment.
What do you gain by changing this? It seems there are still other places that set routed_scaling_factor to None, and you could just as well update the models that don't to pass in None instead of 1.0
https://github.com/search?q=repo%3Avllm-project%2Fvllm%20routed_scaling_factor%3DNone&type=code
There was a problem hiding this comment.
Thanks! Updated this None to 1.0 by default.
This is beneficial since we use a lot of 1.0 somewhere else https://github.com/search?q=repo%3Avllm-project%2Fvllm+routed_scaling_factor%3D1.0&type=code
And we avoid a useless mul operator through doing this
| if (indices_type is not None) and topk_ids.dtype != indices_type: | ||
| topk_ids = topk_ids.to(dtype=indices_type) |
There was a problem hiding this comment.
I don't know if this is valid since other conditionals above use indices_type..
There was a problem hiding this comment.
I remove the param in eplb_map_to_physical_and_record since it is just for a topk_ids.to(dtype=indices_type) and now we can see it more clearly.
The reason we update here is to avoid a duplicate topk_ids.to(dtype=indices_type). Eg. when using use_grouped_topk, we already topk_ids.to(dtype=indices_type), then we pass the indices_type to eplb_map_to_physical_and_record if eplb is enabled, we cast again.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
update Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
@yewentao256 It looks like _eplb_map_to_physical_and_record needs to be updated as well in this file for the cpu backend
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Several small optimizations including:
routed_scaling_factor==1Test
I don't see too much e2e perf improvement since these are small optimizations
Doesn't hurt accuracy: