[MoE] Move triton experts to fused_moe/experts/#41976
Conversation
Extract TritonExperts and TritonWNA16Experts from fused_moe.py into a new experts/triton_moe.py module. Update all references across the codebase (source, tests, C++ comment, docs). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jackmin801 <ongjackm@gmail.com>
…experts Signed-off-by: Jackmin801 <ongjackm@gmail.com> # Conflicts: # vllm/model_executor/layers/fused_moe/__init__.py
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
…experts Signed-off-by: Jackmin801 <ongjackm@gmail.com> # Conflicts: # vllm/lora/layers/fused_moe.py # vllm/model_executor/layers/fused_moe/fused_moe.py
…perts Signed-off-by: Bill Nell <bnell@redhat.com>
|
Documentation preview: https://vllm--41976.org.readthedocs.build/en/41976/ |
There was a problem hiding this comment.
Code Review
This pull request refactors the MoE implementation by moving the Triton-based expert classes, TritonExperts and TritonWNA16Experts, from fused_moe.py to a new dedicated module, experts/triton_moe.py. All associated imports, tests, and documentation have been updated to reflect this change. Feedback highlights a critical circular dependency introduced in the new module; it is recommended to move shared utility functions to a common file and consolidate Triton-specific kernels within triton_moe.py to ensure a clean dependency graph.
| from vllm.model_executor.layers.fused_moe.fused_moe import ( | ||
| _prepare_expert_assignment, | ||
| invoke_fused_moe_triton_kernel, | ||
| invoke_fused_moe_wna16_triton_kernel, | ||
| try_get_optimal_moe_config, | ||
| ) |
There was a problem hiding this comment.
This import from vllm.model_executor.layers.fused_moe.fused_moe creates a circular dependency at the package level. The vllm.model_executor.layers.fused_moe package's __init__.py imports this file (triton_moe.py), which in turn imports fused_moe.py from the same package. While this might not break immediately due to Python's import caching, it is fragile and can lead to ImportError in the future if dependencies change.
To resolve this, I recommend a more complete refactoring to break the cycle:
-
Move generic helpers: Functions like
_prepare_expert_assignmentandtry_get_optimal_moe_configare used by bothfused_moe.pyandtriton_moe.py. They could be moved to a shared utility file (e.g.,vllm/model_executor/layers/fused_moe/utils.py). -
Centralize Triton code: Move the Triton-specific kernels (
fused_moe_kernel,fused_moe_kernel_gptq_awq) and their invoker functions (invoke_fused_moe_triton_kernel,invoke_fused_moe_wna16_triton_kernel) fromfused_moe.pyinto this file (triton_moe.py). This would consolidate all Triton-related MoE code in one place. -
Update imports: The
fused_experts_implfunction infused_moe.py(which appears to be a legacy entry point) can then import the necessary Triton kernel invokers from this file.
This will result in a cleaner dependency graph where fused_moe.py depends on triton_moe.py, but not vice-versa, thus breaking the circular dependency.
|
Combined into one PR #41979 |
Purpose
Extract TritonExperts and TritonWNA16Experts from fused_moe.py into a new experts/triton_moe.py module. Update all references across the codebase (source, tests, C++ comment, docs).
Forked from #40570
cc: @Jackmin801 , @robertgshaw2-redhat
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.