refactor(moe): de-duplicate triton MoE runner path into shared helpers by ch-wan · Pull Request #23019 · sgl-project/sglang

ch-wan · 2026-04-17T03:06:01Z

Motivation

TritonRunnerCore.run (in moe_runner/triton.py) and fused_experts_impl (in fused_moe_triton/fused_moe.py) had grown to ~95% identical logic — same two-kernel + activation + combine pipeline, same platform dispatch ladders, same activation variants — but with subtle divergences (runner missed filter_expert, TMA, enable_fused_moe_sum_all_reduce, non-gated activations, the sgl-kernel moe_sum_reduce path, and the HIP small-token torch.compile branch). Every future kernel change had to be duplicated, with a constant risk of drift — and the LoRA path, which is the sole live consumer of the runner, was the one getting the stale copy.

This PR reconciles both paths and merges them onto a single shared implementation while preserving every current behavior (LoRA hooks, fused-func fast path, end-to-end model accuracy).

Modifications

Three commits, each independently reviewable:

refactor(moe): sync triton runner pipeline with fused_experts_impl (9463977)
- Drop the 64K-token chunk loop in fused_experts_impl (single-shot over all tokens). max_block_m no longer needed.
- Bring the runner path up to parity with fused_experts_impl: filter_expert, down_moe_use_tma (TMA), enable_fused_moe_sum_all_reduce, non-gated silu/gelu/relu2, sgl_kernel.moe_sum_reduce on CUDA, HIP small-token torch.compile branch, PyTorch fallbacks when vllm_ops is missing.
- Unify platform flags/imports (get_bool_env_var, get_moe_padding_size, _has_vllm_ops) across the two files.
- pre_permute_standard_to_triton now calls try_get_optimal_moe_config(..., return_down_config=True) and stashes down_config/down_moe_use_tma in running_state.
- intermediate_cache1/2/3 allocated at their original logical sites (cache2 right before activation, cache3 right before the second kernel); each del'd once fully consumed so the caching allocator can reuse memory.
- LoRA hooks (after_gate_up/after_down) and the or hooks widening of _use_intermediate are preserved as the only runner-only additions.
refactor(moe): extract shared _fused_moe_kernel_sequence helper (3e4b8c5)
- Extract two helpers into fused_moe.py:
  - _prepare_fused_moe_run: resolves padded_size, config_dtype, optimal config (with down_config + TMA flag), and moe_align_block_size. Used by fused_experts_impl and pre_permute_standard_to_triton.
  - _fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine sequence over already-aligned inputs. Takes an optional LoRA hooks object; the second-kernel write target and the CUDA topk==1, routed==1.0 shortcut both use a unified _use_intermediate = not no_combine and (topk != 1 or hooks) guard, which is a no-op for hooks=None and preserves the runner's widening when hooks are present.
- Callers collapse to thin adapters:
  - fused_experts_impl: assertions → _prepare_fused_moe_run → _fused_moe_kernel_sequence(hooks=None).
  - TritonRunnerCore.run: derive filter_expert → _fused_moe_kernel_sequence(hooks=hooks) on pre-aligned runner inputs.
  - pre_permute_standard_to_triton: _prepare_fused_moe_run → stash config/down_config/down_moe_use_tma in running_state.
- moe_runner/triton.py loses all its platform-dispatch plumbing; the entire CUDA/HIP/XPU/vllm-fallback ladder lives in one place.
refactor(moe): move triton util modules under moe_runner/triton_utils (bae9029)
- Relocate the pure-Triton MoE utilities out of layers/moe/fused_moe_triton/ into layers/moe/moe_runner/triton_utils/, co-locating them with the runner that drives them. FusedMoE / FusedMoeWeightScaleSupported / fused_marlin_moe / layer.py / triton_kernels_moe.py stay (higher-level or unrelated).
- Files moved (git-rename preserved): fused_moe.py, fused_moe_triton_config.py, fused_moe_triton_kernels.py, moe_align_block_size.py.
- fused_moe_triton/__init__.py now re-exports the public surface (fused_experts, moe_align_block_size, try_get_optimal_moe_config, get_config_file_name, override_config, get_config) from the new location, so existing callers using the package-level API are unchanged. 24 direct submodule importers (tests, benchmarks, models, quantization, lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new path.
- _config / override_config / get_config (the context-manager-based config override) move into the new triton_utils/__init__.py; the fused_moe_triton_config.py late-import of get_config is repointed accordingly.

Net effect: moe_runner/triton.py went from ~500 LOC of duplicated pipeline to ~240 LOC of pure adapter code.

Accuracy Tests

Unit parity (bit-identical vs fused_experts), re-run after each commit:

[OK] M=32 topk=2 act=silu                diff=0.00e+00
[OK] M=32 topk=1 act=silu                diff=0.00e+00
[OK] M=32 topk=4 act=silu                diff=0.00e+00
[OK] M=64 topk=2 act=silu routed=2.5     diff=0.00e+00
[OK] M=8  topk=2 act=silu                diff=0.00e+00
[OK] M=32 topk=2 act=gelu                diff=0.00e+00
[OK] M=32 topk=2 act=silu no_combine     diff=0.00e+00
[OK] M=32 topk=2 act=silu inplace        diff=0.00e+00
[OK] M=128 topk=4 act=silu               diff=0.00e+00
[OK] M=128 topk=4 act=silu routed=0.5    diff=0.00e+00
[OK] M=70000 topk=2 act=silu             diff=0.00e+00

M=70000 exercises what used to be the chunked path (>64K tokens).

GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces the pre_permute → run → post_permute path that LoRA depends on, 100 examples, TP=2):

Commit	GSM8K score
Sync (`9463977`)	0.92
De-dup (`3e4b8c5`)	0.91
Move (`bae9029`)	0.92

test/registered/moe/test_glm4_moe_models.py threshold is 0.80; prior baseline ~0.85.

Speed Tests and Profiling

No measurable perf impact expected — the runner path was missing features (TMA, enable_fused_moe_sum_all_reduce, sgl-kernel moe_sum_reduce, small-token torch.compile branch) that are now enabled; the fused-impl path is unchanged except the dormant chunk loop is gone. No dedicated benchmarks run.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests. (existing test/registered/moe/test_glm4_moe_models.py covers the live path; parity check ran locally.)
Update documentation according to Write documentations. (no user-facing API change)
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Phase 1 of the triton MoE de-dup. Makes `pre_permute_standard_to_triton -> TritonRunnerCore.run -> post_permute_triton_to_standard` compute bit-identically to `fused_experts_impl` so Phase 2 can extract the shared core without semantic drift. - Drop the chunk loop in fused_experts_impl (single-shot over all tokens); max_block_m is no longer needed. - Bring the runner path up to parity: filter_expert, down_moe_use_tma (TMA), enable_fused_moe_sum_all_reduce, non-gated silu/gelu/relu2, sgl_kernel moe_sum_reduce on CUDA, torch.compile small-token branch on HIP non-aiter, PyTorch fallbacks when vllm_ops is missing. - Unify platform flags/imports (get_bool_env_var, get_moe_padding_size, _has_vllm_ops) with fused_moe.py. - pre_permute now calls try_get_optimal_moe_config with return_down_config=True and stashes down_config/down_moe_use_tma in running_state. - Preserve LoRA hooks (after_gate_up/after_down) and the `or hooks` widening of _use_intermediate as the only runner-only additions. - intermediate_cache1/2/3 allocated at their original logical sites (cache2 right before activation, cache3 right before the second kernel); each `del`'d once fully consumed so the caching allocator can reuse memory. Verified bit-identical outputs vs fused_experts across silu/gelu, topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace}, and at M=70000 (previously spanned 2 chunks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 of the triton MoE de-dup. Both MoE entry points now dispatch through two helpers living in fused_moe.py: - _prepare_fused_moe_run: resolves padded_size, config_dtype, optimal config (with down_config + TMA flag), and moe_align_block_size. Used by fused_experts_impl and pre_permute_standard_to_triton. - _fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine sequence over already-aligned inputs. Takes an optional LoRA hooks object; the second-kernel output selection and the CUDA `topk==1, routed==1.0` shortcut both use a unified `_use_intermediate = not no_combine and (topk != 1 or hooks)` guard, which is a no-op for hooks=None and preserves the runner's widening when hooks are present. Callers collapse to thin adapters: - fused_experts_impl: assertions -> _prepare_fused_moe_run -> _fused_moe_kernel_sequence(hooks=None). - TritonRunnerCore.run: derive filter_expert -> _fused_moe_kernel_sequence(hooks=hooks) on pre-aligned runner inputs. - pre_permute_standard_to_triton: _prepare_fused_moe_run -> stash config/down_config/down_moe_use_tma in running_state. triton.py loses all its platform-dispatch plumbing; the entire CUDA/HIP/XPU/vllm-fallback ladder now lives in one place. Verified bit-identical outputs vs fused_experts across silu/gelu, topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace}, M=70000. GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces pre_permute->run->post_permute path) scored 0.91 (threshold 0.80). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Relocate the pure-Triton MoE utilities out of `layers/moe/fused_moe_triton/` and into `layers/moe/moe_runner/triton_utils/`, co-locating them with the runner that actually drives them. The old package keeps `FusedMoE` / `FusedMoeWeightScaleSupported` / `fused_marlin_moe` / `layer.py` / `triton_kernels_moe.py` since those are higher-level or unrelated. Files moved (git-rename preserved): - fused_moe.py - fused_moe_triton_config.py - fused_moe_triton_kernels.py - moe_align_block_size.py `fused_moe_triton/__init__.py` now re-exports the public surface (`fused_experts`, `moe_align_block_size`, `try_get_optimal_moe_config`, `get_config_file_name`, `override_config`, `get_config`) from the new location, so existing callers using the package-level API are unchanged. 24 direct submodule importers (tests, benchmarks, models, quantization, lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new path. `_config`/`override_config`/`get_config` (the context-manager-based config override) moves into the new `triton_utils/__init__.py`; the `fused_moe_triton_config.py` late-import of `get_config` is repointed accordingly. Verified: 11-case parity check (bit-identical vs fused_experts) and GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces pre_permute->run->post_permute path) scored 0.92. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-04-17T03:06:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ch-wan · 2026-04-17T03:06:27Z

/tag-and-rerun-ci

The previous move left `fused_moe_triton/configs/` behind while `fused_moe_triton_config.py` moved into `moe_runner/triton_utils/`. `get_moe_configs` resolves `config_dir = os.path.dirname(os.path.realpath(__file__))`, so it started looking in the new module's directory and found no tuned kernel configs. Every MoE kernel launch silently fell back to the default block-size config, regressing serving throughput by ~27% on Mixtral-8x7B TP=2 (2980 -> 2185 tok/s in the stage-b-test-2-gpu-large test_moe_offline_throughput_default benchmark). `git mv` the 290-file, 1.3 MB `configs/` tree so it sits next to the resolver that reads it. `SGLANG_MOE_CONFIG_DIR` override still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The runner used to allocate `intermediate_cache1` as 3D `(M, topk, gate_up_dim)` and hand it to the LoRA `after_gate_up` hook, which unpacks `M, top_k, gate_up_dim = intermediate_cache.shape`. The earlier sync commit unified the buffer layout with `fused_experts_impl` and made it 2D `(total_tokens, gate_up_dim)` so that the TMA-padded region can live contiguously at the tail. That flattening wasn't propagated to the hook call, so LoRA-on-MoE runs crashed at the hook with `ValueError: not enough values to unpack (expected 3, got 2)` (surfaced in stage-b-test-2-gpu-large test_moe_lora_tp_logprob_diff on Qwen2-MoE TP=2). Slice off any TMA padding and reshape to the hook's expected 3D shape right at the call site. The view shares storage, so the hook's in-place delta writes still propagate into the 2D backing buffer that the activation kernel reads. Parity unchanged (11 cases bit-identical vs fused_experts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d helpers (#23019)" This reverts commit 5f7aee7.

sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After sgl-project#23019 moved the MoE config loader and the configs/ tree from `fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs unknowingly added 33 tuned-config JSONs to the OLD path: - sgl-project#22791 (LFM2) — 24 files (E=32/64, H100/B200/MI325X) - sgl-project#23533 (Hy3 preview) — 9 files (E=192,N=192 incl. _down, H20/H20-3e/B200) The runtime loader anchors its search via os.path.dirname(os.path.realpath(__file__)) of the loader file (now in moe_runner/triton_utils/), so configs in the old directory were never read — runtime fell back to get_default_config(). The configs themselves were properly tuned and benchmarked at submission time via the in-process override_config() path used by the tuning script — that is why the PR authors observed real speedup. The bug is purely a wrong filesystem location. Root cause: the tuning README still pointed contributors to the old path. This PR moves the misplaced configs into the runtime-loaded location and fixes the README. Changes: * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/ * Update benchmark/kernels/fused_moe_triton/README.md path No content changes. No code changes. References: sgl-project#23019 sgl-project#22791 sgl-project#23533

…project#24004)

ch-wan and others added 3 commits April 17, 2026 01:02

ch-wan requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ispobock, lifuhuang, merrymercy and yushengsu-thu as code owners April 17, 2026 03:06

github-actions Bot added amd lora deepseek labels Apr 17, 2026

github-actions Bot added the run-ci label Apr 17, 2026

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 17, 2026

ch-wan merged commit 5f7aee7 into main Apr 18, 2026
236 of 328 checks passed

ch-wan deleted the cheng/refactor/moe-triton branch April 18, 2026 00:05

bingxche added a commit that referenced this pull request Apr 18, 2026

Revert "refactor(moe): de-duplicate triton MoE runner path into share…

ea3d8e6

…d helpers (#23019)" This reverts commit 5f7aee7.

jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026

refactor(moe): de-duplicate triton MoE runner path into shared helpers (

a24e285

sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

refactor(moe): de-duplicate triton MoE runner path into shared helpers (

8effd1c

sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

refactor(moe): de-duplicate triton MoE runner path into shared helpers (

aee5267

sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026

docs: update LWS MoE config mount path after sgl-project#23019

62f5997

Wen-xuan-Xu mentioned this pull request Apr 29, 2026

fix(moe): relocate orphan tuned configs after #23019 #24004

Merged

5 tasks

Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026

docs: update LWS MoE config mount path after sgl-project#23019

7ec2604

Qiaolin-Yu pushed a commit that referenced this pull request Apr 29, 2026

fix(moe): relocate orphan tuned configs after #23019 (#24004)

d9270b8

JustinTong0323 mentioned this pull request Apr 29, 2026

fix(moe): repair dead import in fused_moe_native after MoE refactor #24069

Merged

3 tasks

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

fix(moe): relocate orphan tuned configs after sgl-project#23019 (sgl-…

816a1a5

…project#24004)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019

refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019
ch-wan merged 5 commits intomainfrom
cheng/refactor/moe-triton

ch-wan commented Apr 17, 2026

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

ch-wan commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ch-wan commented Apr 17, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

ch-wan commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant