Skip to content

MoE Refactor: Refactor fp8.py -> flashinfer_trllm.py#15151

Merged
ch-wan merged 4 commits intosgl-project:mainfrom
bzhng-development:brayden/refactor-fp8-trtllm
Jan 7, 2026
Merged

MoE Refactor: Refactor fp8.py -> flashinfer_trllm.py#15151
ch-wan merged 4 commits intosgl-project:mainfrom
bzhng-development:brayden/refactor-fp8-trtllm

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Dec 15, 2025

Part of #8715. There are many bugs recently related to Flashinfer MoE, partially bc this file and modelopt_quant.py is getting too complex.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Dec 15, 2025

/tag-and-rerun-ci agai

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors FlashInfer TRT-LLM MoE quantization logic by extracting it from fp8.py into a new dedicated file flashinfer_trtllm.py. The refactoring aims to reduce complexity in the quantization layer files, making the codebase more maintainable and addressing recent bugs related to FlashInfer MoE.

Key changes:

  • Created new flashinfer_trtllm.py module with dedicated functions and dataclasses for FlashInfer TRT-LLM FP8 MoE operations
  • Refactored fp8.py to use the new module instead of inline implementations
  • Updated runner.py to properly handle FlashInfer TRT-LLM backend with validation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py New file containing extracted FlashInfer TRT-LLM MoE logic including weight alignment, quantization info dataclass, and fused kernel function
python/sglang/srt/layers/quantization/fp8.py Refactored to delegate FlashInfer TRT-LLM operations to the new module, replacing ~150 lines of inline code with cleaner function calls
python/sglang/srt/layers/moe/moe_runner/runner.py Added FlashInfer TRT-LLM backend support with proper initialization and validation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py Outdated
Comment thread python/sglang/srt/layers/quantization/fp8.py
@b8zhong b8zhong force-pushed the brayden/refactor-fp8-trtllm branch 3 times, most recently from 1d66121 to c45d53e Compare December 19, 2025 04:24
@ch-wan ch-wan self-assigned this Dec 19, 2025
):
if moe_runner_backend.is_flashinfer_trtllm():
# Import to register the fused function
from sglang.srt.layers.moe.moe_runner.flashinfer_trtllm import ( # noqa: F401
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this import to the beginning of this file? Our current strategy for avoiding circular import is to delay import for quant kernels in moe runner. Example:

from sglang.srt.layers import deep_gemm_wrapper
from sglang.srt.layers.moe.ep_moe.kernels import (
silu_and_mul_masked_post_quant_fwd,
)
from sglang.srt.layers.quantization.fp8_kernel import (
sglang_per_token_group_quant_8bit,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function is much simpler, we can merge it with the original apply function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged this into apply logic 👍

@b8zhong b8zhong force-pushed the brayden/refactor-fp8-trtllm branch from 6f6fdd5 to 19ec46c Compare January 5, 2026 02:54
@b8zhong b8zhong force-pushed the brayden/refactor-fp8-trtllm branch from 19ec46c to a156750 Compare January 6, 2026 02:33
@ch-wan ch-wan merged commit 24b30f7 into sgl-project:main Jan 7, 2026
304 of 318 checks passed
@b8zhong b8zhong deleted the brayden/refactor-fp8-trtllm branch January 7, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants