[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285
[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285vllm-bot merged 8 commits intovllm-project:mainfrom
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for rounding up the hidden_size in FusedMoE layers, moving the responsibility from the generic FusedMoE layer to the specific quantization methods. This is a good architectural improvement. My main feedback is about code duplication and a potential bug in QuarkOCP_MX_MoEMethod where the roundup logic is applied unconditionally for gpt_oss models, even for non-MXFP4 quantization types.
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for rounding up the hidden size in FusedMoE layers by moving it from the generic layer.py to the specific quant_method implementations. This is a good architectural improvement, as it places quantization-specific logic where it belongs. The changes in fused_moe_method_base.py and layer.py are correct. However, this refactoring has introduced code duplication in mxfp4.py and quark_moe.py for handling gpt_oss models. I've added comments with suggestions to address this.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the hidden_size roundup logic for FusedMoE layers by moving it into the quant_method. This is a good architectural improvement as it localizes quantization-specific logic. The changes are well-structured. I've found one issue where a function is called with incorrect arguments, which I've detailed in a specific comment.
|
This pull request has merge conflicts that must be resolved before it can be |
|
FYI #32307 might be relevant; I'm not sure what the pad size for gpt-oss on MI300 should be i.e. 128 or 256. Needs further investigation, haven't had time to run proper perf unfortunately. |
8b8fcbd to
6e4c34c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the hidden_size and intermediate_size rounding logic in the FusedMoE layer by moving it into the quant_method. This is a significant improvement in maintainability as it centralizes quantization-specific alignment requirements (especially for MXFP4 backends) within the quantization methods themselves, rather than having brittle model-type checks in the core layer logic. The changes ensure that both the moe_config and the actual weight tensors are created with consistent, correctly padded dimensions across different hardware platforms and quantization schemes.
|
this is a nice simplification. I wonder if we can go even further by making the layer just unaware of the hidden size / intermediate size? WDYT? |
|
I think should be do-able, see #34285 (comment), unless there are other use cases of |
|
Hi @BowenBao, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
daa5f22 to
7a71c20
Compare
|
Hi @BowenBao, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
On latest commit MiniMax TP4 GSM8K Results:
gpt-oss-120b TP4 GPQA: 65.72% If no more concerns, @tjtanaa could you approve the PR? |
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
Kernels (B200) is a known test failure as of 3/25 nightly: https://buildkite.com/vllm/ci/builds/58103/steps/canvas?sid=019d26cd-7ecd-40ff-b3d4-b07dadd7578b&tab=output |
Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
@gshtras done |
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Update ROCm padding logic to improve performance on mi300x. Thanks for suggestion and evaluation from @Rohan138.moved to separate pr {ROCm]: gpt-oss fusion/padding fixes #38043