MoE Refactor: Refactor fp8.py -> flashinfer_trllm.py#15151
MoE Refactor: Refactor fp8.py -> flashinfer_trllm.py#15151ch-wan merged 4 commits intosgl-project:mainfrom
fp8.py -> flashinfer_trllm.py#15151Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci agai |
There was a problem hiding this comment.
Pull request overview
This PR refactors FlashInfer TRT-LLM MoE quantization logic by extracting it from fp8.py into a new dedicated file flashinfer_trtllm.py. The refactoring aims to reduce complexity in the quantization layer files, making the codebase more maintainable and addressing recent bugs related to FlashInfer MoE.
Key changes:
- Created new
flashinfer_trtllm.pymodule with dedicated functions and dataclasses for FlashInfer TRT-LLM FP8 MoE operations - Refactored
fp8.pyto use the new module instead of inline implementations - Updated
runner.pyto properly handle FlashInfer TRT-LLM backend with validation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py |
New file containing extracted FlashInfer TRT-LLM MoE logic including weight alignment, quantization info dataclass, and fused kernel function |
python/sglang/srt/layers/quantization/fp8.py |
Refactored to delegate FlashInfer TRT-LLM operations to the new module, replacing ~150 lines of inline code with cleaner function calls |
python/sglang/srt/layers/moe/moe_runner/runner.py |
Added FlashInfer TRT-LLM backend support with proper initialization and validation logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1d66121 to
c45d53e
Compare
| ): | ||
| if moe_runner_backend.is_flashinfer_trtllm(): | ||
| # Import to register the fused function | ||
| from sglang.srt.layers.moe.moe_runner.flashinfer_trtllm import ( # noqa: F401 |
There was a problem hiding this comment.
How about moving this import to the beginning of this file? Our current strategy for avoiding circular import is to delay import for quant kernels in moe runner. Example:
sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py
Lines 215 to 221 in 5e1a495
There was a problem hiding this comment.
As this function is much simpler, we can merge it with the original apply function.
There was a problem hiding this comment.
Merged this into apply logic 👍
6f6fdd5 to
19ec46c
Compare
19ec46c to
a156750
Compare
Part of #8715. There are many bugs recently related to Flashinfer MoE, partially bc this file and
modelopt_quant.pyis getting too complex.