Skip to content

[MoE] Move triton experts to fused_moe/experts/#41976

Closed
bnellnm wants to merge 6 commits into
vllm-project:mainfrom
neuralmagic:move-triton-moe-to-experts
Closed

[MoE] Move triton experts to fused_moe/experts/#41976
bnellnm wants to merge 6 commits into
vllm-project:mainfrom
neuralmagic:move-triton-moe-to-experts

Conversation

@bnellnm

@bnellnm bnellnm commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Extract TritonExperts and TritonWNA16Experts from fused_moe.py into a new experts/triton_moe.py module. Update all references across the codebase (source, tests, C++ comment, docs).

Forked from #40570

cc: @Jackmin801 , @robertgshaw2-redhat

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Jackmin801 and others added 6 commits April 22, 2026 02:34
Extract TritonExperts and TritonWNA16Experts from fused_moe.py into a
new experts/triton_moe.py module. Update all references across the
codebase (source, tests, C++ comment, docs).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jackmin801 <ongjackm@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
…experts

Signed-off-by: Jackmin801 <ongjackm@gmail.com>

# Conflicts:
#	vllm/model_executor/layers/fused_moe/__init__.py
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
…experts

Signed-off-by: Jackmin801 <ongjackm@gmail.com>

# Conflicts:
#	vllm/lora/layers/fused_moe.py
#	vllm/model_executor/layers/fused_moe/fused_moe.py
…perts

Signed-off-by: Bill Nell <bnell@redhat.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify

mergify Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--41976.org.readthedocs.build/en/41976/

@mergify mergify Bot added documentation Improvements or additions to documentation nvidia labels May 7, 2026
@bnellnm bnellnm changed the title Move triton moe to experts [MoE] Move triton experts to fused_moe/experts/ May 7, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MoE implementation by moving the Triton-based expert classes, TritonExperts and TritonWNA16Experts, from fused_moe.py to a new dedicated module, experts/triton_moe.py. All associated imports, tests, and documentation have been updated to reflect this change. Feedback highlights a critical circular dependency introduced in the new module; it is recommended to move shared utility functions to a common file and consolidate Triton-specific kernels within triton_moe.py to ensure a clean dependency graph.

Comment on lines +15 to +20
from vllm.model_executor.layers.fused_moe.fused_moe import (
_prepare_expert_assignment,
invoke_fused_moe_triton_kernel,
invoke_fused_moe_wna16_triton_kernel,
try_get_optimal_moe_config,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This import from vllm.model_executor.layers.fused_moe.fused_moe creates a circular dependency at the package level. The vllm.model_executor.layers.fused_moe package's __init__.py imports this file (triton_moe.py), which in turn imports fused_moe.py from the same package. While this might not break immediately due to Python's import caching, it is fragile and can lead to ImportError in the future if dependencies change.

To resolve this, I recommend a more complete refactoring to break the cycle:

  1. Move generic helpers: Functions like _prepare_expert_assignment and try_get_optimal_moe_config are used by both fused_moe.py and triton_moe.py. They could be moved to a shared utility file (e.g., vllm/model_executor/layers/fused_moe/utils.py).

  2. Centralize Triton code: Move the Triton-specific kernels (fused_moe_kernel, fused_moe_kernel_gptq_awq) and their invoker functions (invoke_fused_moe_triton_kernel, invoke_fused_moe_wna16_triton_kernel) from fused_moe.py into this file (triton_moe.py). This would consolidate all Triton-related MoE code in one place.

  3. Update imports: The fused_experts_impl function in fused_moe.py (which appears to be a legacy entry point) can then import the necessary Triton kernel invokers from this file.

This will result in a cleaner dependency graph where fused_moe.py depends on triton_moe.py, but not vice-versa, thus breaking the circular dependency.

@bnellnm

bnellnm commented May 8, 2026

Copy link
Copy Markdown
Collaborator Author

Combined into one PR #41979

@bnellnm bnellnm closed this May 8, 2026
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants