[refactor] remove unnecessary padding in MoE by tianyu-l · Pull Request #2774 · pytorch/torchtitan

tianyu-l · 2026-04-01T04:24:18Z

including

remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm.
replace the kernel with a torch-native generate_permutation_indices impl, which doesn't do padding.
added TorchAOExpertParallel to do permutation+padding during EP using the same triton kernel moved to torchao.
added a llama4_debugmodel_fp8 in llama4 config_registry to demonstrate
refactored HybridEP to take pad_multiple as input so that it works with MXFP8.
flatten common/moe/ folder by moving moe.py and moe_deepep.py to common/ and remove utils.py
remove dual_pipe_v.py as it doesn't compose with SAC and is not maintained.
disabled compile.enable in llama4 and gpt-oss CI as they break the CI even on main, tracking in PP + Compile breaking CI #2771 and gpt-oss + compile breaking CI #2776, respectively

Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.

fegin · 2026-04-01T16:38:27Z

+PAD_MULTIPLE_MAP: dict[str, int] = {
+    "float8": 16,
+    "mxfp8": 32,
+}


Suggested change

PAD_MULTIPLE_MAP: dict[str, int] = {

"float8": 16,

"mxfp8": 32,

}

class PAD_MULTIPLE_MAP(IntEnum):

float8: 16

mxfp8: 32

I'm trying to

be consistent with https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/quantization/float8.py#L38

be consistent with https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/__init__.py#L9

in general torchtitan uses str / Literal instead of Enum

danielvegamyhre · 2026-04-01T20:26:32Z

+
+def _unpermute(out, input_shape, permuted_indices):
+    out_unpermuted = out.new_empty(input_shape)
+    out_unpermuted[permuted_indices, :] = out


did you test mxfp8, or only bf16? in my impl i found that for the standard EP case + mxfp8, the extra padding row added at index -1 by the EP per-group padding logic (generate_permute_indices generates -1 indexes, then permute() implements padding by selecting this row in permuted = tokens[permuted_indices, :]) needs to be removed as part of unpermute. However, for the non-EP case, padding and unpadding will have been done by torchao CUDA kernels, and no such row removal is needed. Does your implementation handle this differently?

def _token_combine( self, mod: nn.Module, routed_output: Tensor, device_mesh: DeviceMesh ) -> Tensor: # If per group padding was done to prepare for MXFP8 grouped mm, there is an extra 'padding' row that # `permuted_indices` selects from to add padding into the routed_input in dispatch. routed_output = _unpermute( routed_output, self.input_shape, self.permuted_indices, remove_padding_row=True, )

@danielvegamyhre
This is the bf16 path. For mxfp8 please take a look at the _unpermute in TorchAOExpertParallel.

No I haven't tested mxfp8 because I don't have Blackwell dev machine. But I probably should try to get one.

i see now - lgtm

fegin

Should we add a UT for _generate_permute_indices. Now that it is a pure torch function, not a Triton kernel, we should have an unittest for this one.

Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.

including - remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm. - replace the kernel with a torch-native `generate_permutation_indices` impl, which doesn't do padding. - added `TorchAOExpertParallel` to do permutation+padding during EP using the same triton kernel moved to torchao. - added a `llama4_debugmodel_fp8` in llama4 `config_registry` to demonstrate - refactored HybridEP to take `pad_multiple` as input so that it works with MXFP8. - flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to `common/` and remove `utils.py` - remove `dual_pipe_v.py` as it doesn't compose with SAC and is not maintained. - disabled `compile.enable` in llama4 and gpt-oss CI as they break the CI even on main, tracking in pytorch#2771 and pytorch#2776, respectively Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.

Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.

Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Drop the ep_enabled parameter — mark_dynamic is now applied unconditionally (harmless without EP) and the idempotency check uses a simple _tt_compiled attribute instead of qualname matching. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.

Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.

Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed now that the CI uses cu130+ nightly which handles the unbacked-symint Eq(u1, u2) constraints in inductor. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.

including - remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm. - replace the kernel with a torch-native `generate_permutation_indices` impl, which doesn't do padding. - added `TorchAOExpertParallel` to do permutation+padding during EP using the same triton kernel moved to torchao. - added a `llama4_debugmodel_fp8` in llama4 `config_registry` to demonstrate - refactored HybridEP to take `pad_multiple` as input so that it works with MXFP8. - flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to `common/` and remove `utils.py` - remove `dual_pipe_v.py` as it doesn't compose with SAC and is not maintained. - disabled `compile.enable` in llama4 and gpt-oss CI as they break the CI even on main, tracking in pytorch#2771 and pytorch#2776, respectively Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.

tianyu-l requested a review from danielvegamyhre April 1, 2026 04:24

tianyu-l requested review from fegin, wconstab and wwwjn as code owners April 1, 2026 04:24

pytorch-bot Bot added the ciflow/8gpu label Apr 1, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026

tianyu-l mentioned this pull request Apr 1, 2026

Only apply grouped GEMM padding for MXFP8 and FP8 non-HybridEP cases #2620

Closed

tianyu-l requested a review from acisseJZhong April 1, 2026 04:25

tianyu-l force-pushed the padding branch 3 times, most recently from 8ab5a1e to 7566b6d Compare April 1, 2026 06:34

fegin reviewed Apr 1, 2026

View reviewed changes

danielvegamyhre reviewed Apr 1, 2026

View reviewed changes

tianyu-l mentioned this pull request Apr 1, 2026

[MoE] change torch.bmm back to scatter add #2775

Merged

fegin approved these changes Apr 2, 2026

View reviewed changes

[refactor] remove unnecessary padding in MoE

b1d08e5

tianyu-l force-pushed the padding branch from 7566b6d to b1d08e5 Compare April 2, 2026 07:18

tianyu-l merged commit fe80b63 into main Apr 2, 2026
22 of 34 checks passed

tianyu-l deleted the padding branch April 2, 2026 07:55

tianyu-l mentioned this pull request Apr 3, 2026

Enable per-layer compile with or without MoE #2741

Merged

aditvenk mentioned this pull request Apr 8, 2026

torch.compile fails with DeepSeekV3 + SimpleFSDP #2312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor] remove unnecessary padding in MoE#2774

[refactor] remove unnecessary padding in MoE#2774
tianyu-l merged 1 commit into
mainfrom
padding

tianyu-l commented Apr 1, 2026 •

edited

Loading

Uh oh!

fegin Apr 1, 2026

Uh oh!

tianyu-l Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre Apr 1, 2026 •

edited

Loading

Uh oh!

tianyu-l Apr 2, 2026

Uh oh!

danielvegamyhre Apr 2, 2026

Uh oh!

fegin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianyu-l commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianyu-l commented Apr 1, 2026 •

edited

Loading

danielvegamyhre Apr 1, 2026 •

edited

Loading