MoE Refactor: Refactor `fp8.py` -> `flashinfer_trllm.py` by b8zhong · Pull Request #15151 · sgl-project/sglang

b8zhong · 2025-12-15T06:10:36Z

Part of #8715. There are many bugs recently related to Flashinfer MoE, partially bc this file and modelopt_quant.py is getting too complex.

gemini-code-assist · 2025-12-15T06:10:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2025-12-15T06:11:35Z

/tag-and-rerun-ci agai

Copilot

Pull request overview

This PR refactors FlashInfer TRT-LLM MoE quantization logic by extracting it from fp8.py into a new dedicated file flashinfer_trtllm.py. The refactoring aims to reduce complexity in the quantization layer files, making the codebase more maintainable and addressing recent bugs related to FlashInfer MoE.

Key changes:

Created new flashinfer_trtllm.py module with dedicated functions and dataclasses for FlashInfer TRT-LLM FP8 MoE operations
Refactored fp8.py to use the new module instead of inline implementations
Updated runner.py to properly handle FlashInfer TRT-LLM backend with validation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py`	New file containing extracted FlashInfer TRT-LLM MoE logic including weight alignment, quantization info dataclass, and fused kernel function
`python/sglang/srt/layers/quantization/fp8.py`	Refactored to delegate FlashInfer TRT-LLM operations to the new module, replacing ~150 lines of inline code with cleaner function calls
`python/sglang/srt/layers/moe/moe_runner/runner.py`	Added FlashInfer TRT-LLM backend support with proper initialization and validation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ch-wan · 2025-12-22T23:20:16Z

+        ):
+            if moe_runner_backend.is_flashinfer_trtllm():
+                # Import to register the fused function
+                from sglang.srt.layers.moe.moe_runner.flashinfer_trtllm import (  # noqa: F401


How about moving this import to the beginning of this file? Our current strategy for avoiding circular import is to delay import for quant kernels in moe runner. Example:

sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py

Lines 215 to 221 in 5e1a495

from sglang.srt.layers import deep_gemm_wrapper

from sglang.srt.layers.moe.ep_moe.kernels import (

silu_and_mul_masked_post_quant_fwd,

)

from sglang.srt.layers.quantization.fp8_kernel import (

sglang_per_token_group_quant_8bit,

)

ch-wan · 2025-12-22T23:22:22Z

As this function is much simpler, we can merge it with the original apply function.

Merged this into apply logic 👍

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 15, 2025 06:10

b8zhong requested a review from Copilot December 15, 2025 06:10

Copilot started reviewing on behalf of b8zhong December 15, 2025 06:11 View session

github-actions Bot added the run-ci label Dec 15, 2025

Copilot AI reviewed Dec 15, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py Outdated

Comment thread python/sglang/srt/layers/quantization/fp8.py

b8zhong force-pushed the brayden/refactor-fp8-trtllm branch 3 times, most recently from 1d66121 to c45d53e Compare December 19, 2025 04:24

ch-wan self-assigned this Dec 19, 2025

ch-wan reviewed Dec 22, 2025

View reviewed changes

b8zhong force-pushed the brayden/refactor-fp8-trtllm branch from 6f6fdd5 to 19ec46c Compare January 5, 2026 02:54

b8zhong and others added 4 commits January 5, 2026 18:33

more

ce8e4bb

more

aae89d5

more

f1f7cc3

more

a156750

b8zhong force-pushed the brayden/refactor-fp8-trtllm branch from 19ec46c to a156750 Compare January 6, 2026 02:33

ch-wan approved these changes Jan 7, 2026

View reviewed changes

ch-wan merged commit 24b30f7 into sgl-project:main Jan 7, 2026
304 of 318 checks passed

b8zhong deleted the brayden/refactor-fp8-trtllm branch January 7, 2026 23:41

This was referenced Jan 8, 2026

[Roadmap] MoE Refactor #8715

Open

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py #16685

Merged

elvischenv mentioned this pull request Feb 1, 2026

[Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE #18065

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Refactor: Refactor `fp8.py` -> `flashinfer_trllm.py`#15151

MoE Refactor: Refactor `fp8.py` -> `flashinfer_trllm.py`#15151
ch-wan merged 4 commits intosgl-project:mainfrom
bzhng-development:brayden/refactor-fp8-trtllm

b8zhong commented Dec 15, 2025

Uh oh!

gemini-code-assist Bot commented Dec 15, 2025

Uh oh!

b8zhong commented Dec 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ch-wan Dec 22, 2025

Uh oh!

ch-wan Dec 22, 2025

Uh oh!

b8zhong Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	from sglang.srt.layers import deep_gemm_wrapper
	from sglang.srt.layers.moe.ep_moe.kernels import (
	silu_and_mul_masked_post_quant_fwd,
	)
	from sglang.srt.layers.quantization.fp8_kernel import (
	sglang_per_token_group_quant_8bit,
	)

Conversation

b8zhong commented Dec 15, 2025

Uh oh!

gemini-code-assist Bot commented Dec 15, 2025

Uh oh!

b8zhong commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

ch-wan Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Dec 15, 2025 •

edited

Loading