Skip to content

refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019

Merged
ch-wan merged 5 commits intomainfrom
cheng/refactor/moe-triton
Apr 18, 2026
Merged

refactor(moe): de-duplicate triton MoE runner path into shared helpers#23019
ch-wan merged 5 commits intomainfrom
cheng/refactor/moe-triton

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented Apr 17, 2026

Motivation

TritonRunnerCore.run (in moe_runner/triton.py) and fused_experts_impl (in fused_moe_triton/fused_moe.py) had grown to ~95% identical logic — same two-kernel + activation + combine pipeline, same platform dispatch ladders, same activation variants — but with subtle divergences (runner missed filter_expert, TMA, enable_fused_moe_sum_all_reduce, non-gated activations, the sgl-kernel moe_sum_reduce path, and the HIP small-token torch.compile branch). Every future kernel change had to be duplicated, with a constant risk of drift — and the LoRA path, which is the sole live consumer of the runner, was the one getting the stale copy.

This PR reconciles both paths and merges them onto a single shared implementation while preserving every current behavior (LoRA hooks, fused-func fast path, end-to-end model accuracy).

Modifications

Three commits, each independently reviewable:

  1. refactor(moe): sync triton runner pipeline with fused_experts_impl (9463977)

    • Drop the 64K-token chunk loop in fused_experts_impl (single-shot over all tokens). max_block_m no longer needed.
    • Bring the runner path up to parity with fused_experts_impl: filter_expert, down_moe_use_tma (TMA), enable_fused_moe_sum_all_reduce, non-gated silu/gelu/relu2, sgl_kernel.moe_sum_reduce on CUDA, HIP small-token torch.compile branch, PyTorch fallbacks when vllm_ops is missing.
    • Unify platform flags/imports (get_bool_env_var, get_moe_padding_size, _has_vllm_ops) across the two files.
    • pre_permute_standard_to_triton now calls try_get_optimal_moe_config(..., return_down_config=True) and stashes down_config/down_moe_use_tma in running_state.
    • intermediate_cache1/2/3 allocated at their original logical sites (cache2 right before activation, cache3 right before the second kernel); each del'd once fully consumed so the caching allocator can reuse memory.
    • LoRA hooks (after_gate_up/after_down) and the or hooks widening of _use_intermediate are preserved as the only runner-only additions.
  2. refactor(moe): extract shared _fused_moe_kernel_sequence helper (3e4b8c5)

    • Extract two helpers into fused_moe.py:
      • _prepare_fused_moe_run: resolves padded_size, config_dtype, optimal config (with down_config + TMA flag), and moe_align_block_size. Used by fused_experts_impl and pre_permute_standard_to_triton.
      • _fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine sequence over already-aligned inputs. Takes an optional LoRA hooks object; the second-kernel write target and the CUDA topk==1, routed==1.0 shortcut both use a unified _use_intermediate = not no_combine and (topk != 1 or hooks) guard, which is a no-op for hooks=None and preserves the runner's widening when hooks are present.
    • Callers collapse to thin adapters:
      • fused_experts_impl: assertions → _prepare_fused_moe_run_fused_moe_kernel_sequence(hooks=None).
      • TritonRunnerCore.run: derive filter_expert_fused_moe_kernel_sequence(hooks=hooks) on pre-aligned runner inputs.
      • pre_permute_standard_to_triton: _prepare_fused_moe_run → stash config/down_config/down_moe_use_tma in running_state.
    • moe_runner/triton.py loses all its platform-dispatch plumbing; the entire CUDA/HIP/XPU/vllm-fallback ladder lives in one place.
  3. refactor(moe): move triton util modules under moe_runner/triton_utils (bae9029)

    • Relocate the pure-Triton MoE utilities out of layers/moe/fused_moe_triton/ into layers/moe/moe_runner/triton_utils/, co-locating them with the runner that drives them. FusedMoE / FusedMoeWeightScaleSupported / fused_marlin_moe / layer.py / triton_kernels_moe.py stay (higher-level or unrelated).
    • Files moved (git-rename preserved): fused_moe.py, fused_moe_triton_config.py, fused_moe_triton_kernels.py, moe_align_block_size.py.
    • fused_moe_triton/__init__.py now re-exports the public surface (fused_experts, moe_align_block_size, try_get_optimal_moe_config, get_config_file_name, override_config, get_config) from the new location, so existing callers using the package-level API are unchanged. 24 direct submodule importers (tests, benchmarks, models, quantization, lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new path.
    • _config / override_config / get_config (the context-manager-based config override) move into the new triton_utils/__init__.py; the fused_moe_triton_config.py late-import of get_config is repointed accordingly.

Net effect: moe_runner/triton.py went from ~500 LOC of duplicated pipeline to ~240 LOC of pure adapter code.

Accuracy Tests

Unit parity (bit-identical vs fused_experts), re-run after each commit:

[OK] M=32 topk=2 act=silu                diff=0.00e+00
[OK] M=32 topk=1 act=silu                diff=0.00e+00
[OK] M=32 topk=4 act=silu                diff=0.00e+00
[OK] M=64 topk=2 act=silu routed=2.5     diff=0.00e+00
[OK] M=8  topk=2 act=silu                diff=0.00e+00
[OK] M=32 topk=2 act=gelu                diff=0.00e+00
[OK] M=32 topk=2 act=silu no_combine     diff=0.00e+00
[OK] M=32 topk=2 act=silu inplace        diff=0.00e+00
[OK] M=128 topk=4 act=silu               diff=0.00e+00
[OK] M=128 topk=4 act=silu routed=0.5    diff=0.00e+00
[OK] M=70000 topk=2 act=silu             diff=0.00e+00

M=70000 exercises what used to be the chunked path (>64K tokens).

GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces the pre_permute → run → post_permute path that LoRA depends on, 100 examples, TP=2):

Commit GSM8K score
Sync (9463977) 0.92
De-dup (3e4b8c5) 0.91
Move (bae9029) 0.92

test/registered/moe/test_glm4_moe_models.py threshold is 0.80; prior baseline ~0.85.

Speed Tests and Profiling

No measurable perf impact expected — the runner path was missing features (TMA, enable_fused_moe_sum_all_reduce, sgl-kernel moe_sum_reduce, small-token torch.compile branch) that are now enabled; the fused-impl path is unchanged except the dormant chunk loop is gone. No dedicated benchmarks run.

Checklist

ch-wan and others added 3 commits April 17, 2026 01:02
Phase 1 of the triton MoE de-dup. Makes `pre_permute_standard_to_triton
-> TritonRunnerCore.run -> post_permute_triton_to_standard` compute
bit-identically to `fused_experts_impl` so Phase 2 can extract the
shared core without semantic drift.

- Drop the chunk loop in fused_experts_impl (single-shot over all
  tokens); max_block_m is no longer needed.
- Bring the runner path up to parity: filter_expert, down_moe_use_tma
  (TMA), enable_fused_moe_sum_all_reduce, non-gated silu/gelu/relu2,
  sgl_kernel moe_sum_reduce on CUDA, torch.compile small-token branch
  on HIP non-aiter, PyTorch fallbacks when vllm_ops is missing.
- Unify platform flags/imports (get_bool_env_var, get_moe_padding_size,
  _has_vllm_ops) with fused_moe.py.
- pre_permute now calls try_get_optimal_moe_config with
  return_down_config=True and stashes down_config/down_moe_use_tma in
  running_state.
- Preserve LoRA hooks (after_gate_up/after_down) and the
  `or hooks` widening of _use_intermediate as the only runner-only
  additions.
- intermediate_cache1/2/3 allocated at their original logical sites
  (cache2 right before activation, cache3 right before the second
  kernel); each `del`'d once fully consumed so the caching allocator
  can reuse memory.

Verified bit-identical outputs vs fused_experts across silu/gelu,
topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace},
and at M=70000 (previously spanned 2 chunks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of the triton MoE de-dup. Both MoE entry points now dispatch
through two helpers living in fused_moe.py:

- _prepare_fused_moe_run: resolves padded_size, config_dtype, optimal
  config (with down_config + TMA flag), and moe_align_block_size. Used
  by fused_experts_impl and pre_permute_standard_to_triton.
- _fused_moe_kernel_sequence: runs the kernel/activation/kernel/combine
  sequence over already-aligned inputs. Takes an optional LoRA hooks
  object; the second-kernel output selection and the CUDA
  `topk==1, routed==1.0` shortcut both use a unified
  `_use_intermediate = not no_combine and (topk != 1 or hooks)` guard,
  which is a no-op for hooks=None and preserves the runner's widening
  when hooks are present.

Callers collapse to thin adapters:

- fused_experts_impl: assertions -> _prepare_fused_moe_run ->
  _fused_moe_kernel_sequence(hooks=None).
- TritonRunnerCore.run: derive filter_expert ->
  _fused_moe_kernel_sequence(hooks=hooks) on pre-aligned runner inputs.
- pre_permute_standard_to_triton: _prepare_fused_moe_run -> stash
  config/down_config/down_moe_use_tma in running_state.

triton.py loses all its platform-dispatch plumbing; the entire
CUDA/HIP/XPU/vllm-fallback ladder now lives in one place.

Verified bit-identical outputs vs fused_experts across silu/gelu,
topk 1/2/4, routed_scaling 1.0/2.5/0.5, {regular, no_combine, inplace},
M=70000. GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1
(forces pre_permute->run->post_permute path) scored 0.91 (threshold 0.80).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate the pure-Triton MoE utilities out of `layers/moe/fused_moe_triton/`
and into `layers/moe/moe_runner/triton_utils/`, co-locating them with the
runner that actually drives them. The old package keeps `FusedMoE` /
`FusedMoeWeightScaleSupported` / `fused_marlin_moe` / `layer.py` /
`triton_kernels_moe.py` since those are higher-level or unrelated.

Files moved (git-rename preserved):

- fused_moe.py
- fused_moe_triton_config.py
- fused_moe_triton_kernels.py
- moe_align_block_size.py

`fused_moe_triton/__init__.py` now re-exports the public surface
(`fused_experts`, `moe_align_block_size`, `try_get_optimal_moe_config`,
`get_config_file_name`, `override_config`, `get_config`) from the new
location, so existing callers using the package-level API are unchanged.
24 direct submodule importers (tests, benchmarks, models, quantization,
lora, topk, runner/triton.py, 3rdparty tuning) are updated to the new
path.

`_config`/`override_config`/`get_config` (the context-manager-based
config override) moves into the new `triton_utils/__init__.py`; the
`fused_moe_triton_config.py` late-import of `get_config` is repointed
accordingly.

Verified: 11-case parity check (bit-identical vs fused_experts) and
GLM-4.5-Air-FP8 GSM8K with SGLANG_CI_DISABLE_MOE_FUSED_FUNC=1 (forces
pre_permute->run->post_permute path) scored 0.92.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented Apr 17, 2026

/tag-and-rerun-ci

The previous move left `fused_moe_triton/configs/` behind while
`fused_moe_triton_config.py` moved into `moe_runner/triton_utils/`.
`get_moe_configs` resolves `config_dir = os.path.dirname(os.path.realpath(__file__))`,
so it started looking in the new module's directory and found no tuned
kernel configs. Every MoE kernel launch silently fell back to the
default block-size config, regressing serving throughput by ~27% on
Mixtral-8x7B TP=2 (2980 -> 2185 tok/s in the
stage-b-test-2-gpu-large test_moe_offline_throughput_default benchmark).

`git mv` the 290-file, 1.3 MB `configs/` tree so it sits next to the
resolver that reads it. `SGLANG_MOE_CONFIG_DIR` override still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 17, 2026
The runner used to allocate `intermediate_cache1` as 3D
`(M, topk, gate_up_dim)` and hand it to the LoRA `after_gate_up` hook,
which unpacks `M, top_k, gate_up_dim = intermediate_cache.shape`. The
earlier sync commit unified the buffer layout with `fused_experts_impl`
and made it 2D `(total_tokens, gate_up_dim)` so that the TMA-padded
region can live contiguously at the tail. That flattening wasn't
propagated to the hook call, so LoRA-on-MoE runs crashed at the hook
with `ValueError: not enough values to unpack (expected 3, got 2)`
(surfaced in stage-b-test-2-gpu-large test_moe_lora_tp_logprob_diff on
Qwen2-MoE TP=2).

Slice off any TMA padding and reshape to the hook's expected 3D shape
right at the call site. The view shares storage, so the hook's
in-place delta writes still propagate into the 2D backing buffer that
the activation kernel reads.

Parity unchanged (11 cases bit-identical vs fused_experts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ch-wan ch-wan merged commit 5f7aee7 into main Apr 18, 2026
236 of 328 checks passed
@ch-wan ch-wan deleted the cheng/refactor/moe-triton branch April 18, 2026 00:05
bingxche added a commit that referenced this pull request Apr 18, 2026
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
sgl-project#23019)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
sgl-project#23019)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
sgl-project#23019)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026
After sgl-project#23019 moved the MoE config loader and the configs/ tree from
`fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs
unknowingly added 33 tuned-config JSONs to the OLD path:

- sgl-project#22791 (LFM2)        — 24 files (E=32/64, H100/B200/MI325X)
- sgl-project#23533 (Hy3 preview) —  9 files (E=192,N=192 incl. _down,
                                    H20/H20-3e/B200)

The runtime loader anchors its search via
os.path.dirname(os.path.realpath(__file__)) of the loader file
(now in moe_runner/triton_utils/), so configs in the old
directory were never read — runtime fell back to
get_default_config().

The configs themselves were properly tuned and benchmarked at
submission time via the in-process override_config() path used
by the tuning script — that is why the PR authors observed real
speedup. The bug is purely a wrong filesystem location.

Root cause: the tuning README still pointed contributors to the
old path. This PR moves the misplaced configs into the
runtime-loaded location and fixes the README.

Changes:
  * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/
  * Update benchmark/kernels/fused_moe_triton/README.md path

No content changes. No code changes.

References: sgl-project#23019 sgl-project#22791 sgl-project#23533
Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026
Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026
After sgl-project#23019 moved the MoE config loader and the configs/ tree from
`fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs
unknowingly added 33 tuned-config JSONs to the OLD path:

- sgl-project#22791 (LFM2)        — 24 files (E=32/64, H100/B200/MI325X)
- sgl-project#23533 (Hy3 preview) —  9 files (E=192,N=192 incl. _down,
                                    H20/H20-3e/B200)

The runtime loader anchors its search via
os.path.dirname(os.path.realpath(__file__)) of the loader file
(now in moe_runner/triton_utils/), so configs in the old
directory were never read — runtime fell back to
get_default_config().

The configs themselves were properly tuned and benchmarked at
submission time via the in-process override_config() path used
by the tuning script — that is why the PR authors observed real
speedup. The bug is purely a wrong filesystem location.

Root cause: the tuning README still pointed contributors to the
old path. This PR moves the misplaced configs into the
runtime-loaded location and fixes the README.

Changes:
  * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/
  * Update benchmark/kernels/fused_moe_triton/README.md path

No content changes. No code changes.

References: sgl-project#23019 sgl-project#22791 sgl-project#23533
Wen-xuan-Xu added a commit to Wen-xuan-Xu/sglang that referenced this pull request Apr 29, 2026
@hnyls2002 hnyls2002 mentioned this pull request Apr 29, 2026
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek documentation Improvements or additions to documentation lora quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant