[MoE] Migrate W4A8 CT to Oracle Structure#39197
Conversation
Move kernel selection and weight conversion logic for W4A8 FP8 MoE into a dedicated oracle module, matching the pattern established by the FP8 and NvFP4 oracles. This centralizes backend selection, weight format conversion, quant config creation, and modular kernel construction in oracle/w4a8.py. - Create vllm/model_executor/layers/fused_moe/oracle/w4a8.py with select_w4a8_moe_backend, convert_to_w4a8_moe_kernel_format, make_w4a8_moe_quant_config, and make_w4a8_moe_kernel - Update CutlassExpertsW4A8Fp8 to compute strides from moe_config and implement _supports_* methods for oracle compatibility - Refactor CompressedTensorsW4A8Fp8MoEMethod to use oracle functions and delegate to moe_kernel.apply() Co-authored-by: Claude https://claude.ai/code/session_017178oZ2UoCasfwjjB3zmdR
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request refactors the W4A8 MoE implementation to use a modular kernel architecture. It introduces a new oracle module for W4A8 to centralize weight conversion, reordering for CUTLASS, and kernel configuration. The CutlassExpertsW4A8Fp8 class now computes strides internally and implements various support check methods. CompressedTensorsW4A8Fp8MoEMethod has been updated to leverage these modular components, simplifying its weight processing and application logic. Feedback suggests removing the maybe_make_prepare_finalize method instead of raising a ValueError to maintain a cleaner interface.
| raise ValueError( | ||
| f"{self.__class__.__name__} uses the new modular kernel initialization " | ||
| "logic. This function should not be called." | ||
| ) |
Pre-register w13_weight_chan_scale and w2_weight_chan_scale on the layer in create_weights so all parameter registrations live on the layer. The oracle's convert_to_w4a8_moe_kernel_format now uses replace_parameter to update them after load-time computation. Co-authored-by: Claude https://claude.ai/code/session_017178oZ2UoCasfwjjB3zmdR
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR refactors the W4A8 MoE quantization method to use the new modular kernel oracle pattern, improving code organization and maintainability. The changes extract W4A8-specific logic into a dedicated oracle module (
w4a8.py) and simplify the main quantization method class by delegating kernel construction and weight processing to reusable helper functions.Key improvements:
vllm/model_executor/layers/fused_moe/oracle/w4a8.pyCompressedTensorsW4A8Fp8MoEMethodby removing low-level kernel construction detailsCutlassExpertsW4A8Fp8to compute strides internally from config dimensions instead of accepting them as parametersTest Plan
Existing unit tests for W4A8 MoE quantization should pass. The refactoring maintains functional equivalence while reorganizing code structure. CI tests will verify:
Test Result
N/A - This is a refactoring that maintains functional equivalence. Existing test coverage validates the changes.
https://claude.ai/code/session_017178oZ2UoCasfwjjB3zmdR