refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019
Merged
refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019
Conversation
Phase 1 of the triton MoE de-dup. Makes `pre_permute_standard_to_triton
-> TritonRunnerCore.run -> post_permute_triton_to_standard` compute
bit-identically to `fused_experts_impl` so Phase 2 can extract the
shared core without semantic drift.
- Drop the chunk loop in fused_experts_impl (single-shot over all
tokens); max_block_m is no longer needed.
- Bring the runner path up to parity: filter_expert, down_moe_use_tma
(TMA), enable_fused_moe_sum_all_reduce, non-gated silu/gelu/relu2,
sgl_kernel moe_sum_reduce on CUDA, torch.compile small-token branch
on HIP non-aiter, PyTorch fallbacks when vllm_ops is missing.
- Unify platform flags/imports (get_bool_env_var, get_moe_padding_size,
_has_vllm_ops) with fused_moe.py.
- pre_permute now calls try_get_optimal_moe_config with
return_down_config=True and stashes down_config/down_moe_use_tma in
running_state.
- Preserve LoRA hooks (after_gate_up/after_down) and the
`or hooks` widening of _use_intermediate as the only runner-only
additions.
- intermediate_cache1/2/3 allocated at their original logical sites
(cache2 right before activation, cache3 right before the second
kernel); each `del`'d once fully consumed so the caching allocator
can reuse memory.
Verified bit-identical outputs vs fused_experts across silu/gelu,
topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace},
and at M=70000 (previously spanned 2 chunks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of the triton MoE de-dup. Both MoE entry points now dispatch
through two helpers living in fused_moe.py:
- _prepare_fused_moe_run: resolves padded_size, config_dtype, optimal
config (with down_config + TMA flag), and moe_align_block_size. Used
by fused_experts_impl and pre_permute_standard_to_triton.
- _fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine
sequence over already-aligned inputs. Takes an optional LoRA hooks
object; the second-kernel output selection and the CUDA
`topk==1, routed==1.0` shortcut both use a unified
`_use_intermediate = not no_combine and (topk != 1 or hooks)` guard,
which is a no-op for hooks=None and preserves the runner's widening
when hooks are present.
Callers collapse to thin adapters:
- fused_experts_impl: assertions -> _prepare_fused_moe_run ->
_fused_moe_kernel_sequence(hooks=None).
- TritonRunnerCore.run: derive filter_expert ->
_fused_moe_kernel_sequence(hooks=hooks) on pre-aligned runner inputs.
- pre_permute_standard_to_triton: _prepare_fused_moe_run -> stash
config/down_config/down_moe_use_tma in running_state.
triton.py loses all its platform-dispatch plumbing; the entire
CUDA/HIP/XPU/vllm-fallback ladder now lives in one place.
Verified bit-identical outputs vs fused_experts across silu/gelu,
topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace},
M=70000. GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1
(forces pre_permute->run->post_permute path) scored 0.91 (threshold 0.80).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate the pure-Triton MoE utilities out of `layers/moe/fused_moe_triton/` and into `layers/moe/moe_runner/triton_utils/`, co-locating them with the runner that actually drives them. The old package keeps `FusedMoE` / `FusedMoeWeightScaleSupported` / `fused_marlin_moe` / `layer.py` / `triton_kernels_moe.py` since those are higher-level or unrelated. Files moved (git-rename preserved): - fused_moe.py - fused_moe_triton_config.py - fused_moe_triton_kernels.py - moe_align_block_size.py `fused_moe_triton/__init__.py` now re-exports the public surface (`fused_experts`, `moe_align_block_size`, `try_get_optimal_moe_config`, `get_config_file_name`, `override_config`, `get_config`) from the new location, so existing callers using the package-level API are unchanged. 24 direct submodule importers (tests, benchmarks, models, quantization, lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new path. `_config`/`override_config`/`get_config` (the context-manager-based config override) moves into the new `triton_utils/__init__.py`; the `fused_moe_triton_config.py` late-import of `get_config` is repointed accordingly. Verified: 11-case parity check (bit-identical vs fused_experts) and GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces pre_permute->run->post_permute path) scored 0.92. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
The previous move left `fused_moe_triton/configs/` behind while `fused_moe_triton_config.py` moved into `moe_runner/triton_utils/`. `get_moe_configs` resolves `config_dir = os.path.dirname(os.path.realpath(__file__))`, so it started looking in the new module's directory and found no tuned kernel configs. Every MoE kernel launch silently fell back to the default block-size config, regressing serving throughput by ~27% on Mixtral-8x7B TP=2 (2980 -> 2185 tok/s in the stage-b-test-2-gpu-large test_moe_offline_throughput_default benchmark). `git mv` the 290-file, 1.3 MB `configs/` tree so it sits next to the resolver that reads it. `SGLANG_MOE_CONFIG_DIR` override still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runner used to allocate `intermediate_cache1` as 3D `(M, topk, gate_up_dim)` and hand it to the LoRA `after_gate_up` hook, which unpacks `M, top_k, gate_up_dim = intermediate_cache.shape`. The earlier sync commit unified the buffer layout with `fused_experts_impl` and made it 2D `(total_tokens, gate_up_dim)` so that the TMA-padded region can live contiguously at the tail. That flattening wasn't propagated to the hook call, so LoRA-on-MoE runs crashed at the hook with `ValueError: not enough values to unpack (expected 3, got 2)` (surfaced in stage-b-test-2-gpu-large test_moe_lora_tp_logprob_diff on Qwen2-MoE TP=2). Slice off any TMA padding and reshape to the hook's expected 3D shape right at the call site. The view shares storage, so the hook's in-place delta writes still propagate into the 2D backing buffer that the activation kernel reads. Parity unchanged (11 cases bit-identical vs fused_experts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jmamou
pushed a commit
to jmamou/sglang
that referenced
this pull request
Apr 20, 2026
sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
sgl-project#23019) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wen-xuan-Xu
added a commit
to Wen-xuan-Xu/sglang
that referenced
this pull request
Apr 29, 2026
After sgl-project#23019 moved the MoE config loader and the configs/ tree from `fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs unknowingly added 33 tuned-config JSONs to the OLD path: - sgl-project#22791 (LFM2) — 24 files (E=32/64, H100/B200/MI325X) - sgl-project#23533 (Hy3 preview) — 9 files (E=192,N=192 incl. _down, H20/H20-3e/B200) The runtime loader anchors its search via os.path.dirname(os.path.realpath(__file__)) of the loader file (now in moe_runner/triton_utils/), so configs in the old directory were never read — runtime fell back to get_default_config(). The configs themselves were properly tuned and benchmarked at submission time via the in-process override_config() path used by the tuning script — that is why the PR authors observed real speedup. The bug is purely a wrong filesystem location. Root cause: the tuning README still pointed contributors to the old path. This PR moves the misplaced configs into the runtime-loaded location and fixes the README. Changes: * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/ * Update benchmark/kernels/fused_moe_triton/README.md path No content changes. No code changes. References: sgl-project#23019 sgl-project#22791 sgl-project#23533
Wen-xuan-Xu
added a commit
to Wen-xuan-Xu/sglang
that referenced
this pull request
Apr 29, 2026
5 tasks
Wen-xuan-Xu
added a commit
to Wen-xuan-Xu/sglang
that referenced
this pull request
Apr 29, 2026
After sgl-project#23019 moved the MoE config loader and the configs/ tree from `fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs unknowingly added 33 tuned-config JSONs to the OLD path: - sgl-project#22791 (LFM2) — 24 files (E=32/64, H100/B200/MI325X) - sgl-project#23533 (Hy3 preview) — 9 files (E=192,N=192 incl. _down, H20/H20-3e/B200) The runtime loader anchors its search via os.path.dirname(os.path.realpath(__file__)) of the loader file (now in moe_runner/triton_utils/), so configs in the old directory were never read — runtime fell back to get_default_config(). The configs themselves were properly tuned and benchmarked at submission time via the in-process override_config() path used by the tuning script — that is why the PR authors observed real speedup. The bug is purely a wrong filesystem location. Root cause: the tuning README still pointed contributors to the old path. This PR moves the misplaced configs into the runtime-loaded location and fixes the README. Changes: * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/ * Update benchmark/kernels/fused_moe_triton/README.md path No content changes. No code changes. References: sgl-project#23019 sgl-project#22791 sgl-project#23533
Wen-xuan-Xu
added a commit
to Wen-xuan-Xu/sglang
that referenced
this pull request
Apr 29, 2026
Qiaolin-Yu
pushed a commit
that referenced
this pull request
Apr 29, 2026
3 tasks
Merged
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
TritonRunnerCore.run(inmoe_runner/triton.py) andfused_experts_impl(infused_moe_triton/fused_moe.py) had grown to ~95% identical logic — same two-kernel + activation + combine pipeline, same platform dispatch ladders, same activation variants — but with subtle divergences (runner missedfilter_expert, TMA,enable_fused_moe_sum_all_reduce, non-gated activations, the sgl-kernelmoe_sum_reducepath, and the HIP small-tokentorch.compilebranch). Every future kernel change had to be duplicated, with a constant risk of drift — and the LoRA path, which is the sole live consumer of the runner, was the one getting the stale copy.This PR reconciles both paths and merges them onto a single shared implementation while preserving every current behavior (LoRA hooks, fused-func fast path, end-to-end model accuracy).
Modifications
Three commits, each independently reviewable:
refactor(moe): sync triton runner pipeline with fused_experts_impl(9463977)fused_experts_impl(single-shot over all tokens).max_block_mno longer needed.fused_experts_impl:filter_expert,down_moe_use_tma(TMA),enable_fused_moe_sum_all_reduce, non-gatedsilu/gelu/relu2,sgl_kernel.moe_sum_reduceon CUDA, HIP small-tokentorch.compilebranch, PyTorch fallbacks whenvllm_opsis missing.get_bool_env_var,get_moe_padding_size,_has_vllm_ops) across the two files.pre_permute_standard_to_tritonnow callstry_get_optimal_moe_config(..., return_down_config=True)and stashesdown_config/down_moe_use_tmainrunning_state.intermediate_cache1/2/3allocated at their original logical sites (cache2 right before activation, cache3 right before the second kernel); eachdel'd once fully consumed so the caching allocator can reuse memory.after_gate_up/after_down) and theor hookswidening of_use_intermediateare preserved as the only runner-only additions.refactor(moe): extract shared _fused_moe_kernel_sequence helper(3e4b8c5)fused_moe.py:_prepare_fused_moe_run: resolvespadded_size,config_dtype, optimal config (withdown_config+ TMA flag), andmoe_align_block_size. Used byfused_experts_implandpre_permute_standard_to_triton._fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine sequence over already-aligned inputs. Takes an optional LoRAhooksobject; the second-kernel write target and the CUDAtopk==1, routed==1.0shortcut both use a unified_use_intermediate = not no_combine and (topk != 1 or hooks)guard, which is a no-op forhooks=Noneand preserves the runner's widening when hooks are present.fused_experts_impl: assertions →_prepare_fused_moe_run→_fused_moe_kernel_sequence(hooks=None).TritonRunnerCore.run: derivefilter_expert→_fused_moe_kernel_sequence(hooks=hooks)on pre-aligned runner inputs.pre_permute_standard_to_triton:_prepare_fused_moe_run→ stashconfig/down_config/down_moe_use_tmainrunning_state.moe_runner/triton.pyloses all its platform-dispatch plumbing; the entire CUDA/HIP/XPU/vllm-fallback ladder lives in one place.refactor(moe): move triton util modules under moe_runner/triton_utils(bae9029)layers/moe/fused_moe_triton/intolayers/moe/moe_runner/triton_utils/, co-locating them with the runner that drives them.FusedMoE/FusedMoeWeightScaleSupported/fused_marlin_moe/layer.py/triton_kernels_moe.pystay (higher-level or unrelated).fused_moe.py,fused_moe_triton_config.py,fused_moe_triton_kernels.py,moe_align_block_size.py.fused_moe_triton/__init__.pynow re-exports the public surface (fused_experts,moe_align_block_size,try_get_optimal_moe_config,get_config_file_name,override_config,get_config) from the new location, so existing callers using the package-level API are unchanged. 24 direct submodule importers (tests, benchmarks, models, quantization, lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new path._config/override_config/get_config(the context-manager-based config override) move into the newtriton_utils/__init__.py; thefused_moe_triton_config.pylate-import ofget_configis repointed accordingly.Net effect:
moe_runner/triton.pywent from ~500 LOC of duplicated pipeline to ~240 LOC of pure adapter code.Accuracy Tests
Unit parity (bit-identical vs
fused_experts), re-run after each commit:M=70000exercises what used to be the chunked path (>64K tokens).GLM-4.5-Air-FP8 GSM8K with
SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1(forces thepre_permute → run → post_permutepath that LoRA depends on, 100 examples, TP=2):9463977)3e4b8c5)bae9029)test/registered/moe/test_glm4_moe_models.pythreshold is 0.80; prior baseline ~0.85.Speed Tests and Profiling
No measurable perf impact expected — the runner path was missing features (TMA,
enable_fused_moe_sum_all_reduce, sgl-kernelmoe_sum_reduce, small-tokentorch.compilebranch) that are now enabled; the fused-impl path is unchanged except the dormant chunk loop is gone. No dedicated benchmarks run.Checklist
test/registered/moe/test_glm4_moe_models.pycovers the live path; parity check ran locally.)