Skip to content

[refactor] remove unnecessary padding in MoE#2774

Merged
tianyu-l merged 1 commit into
mainfrom
padding
Apr 2, 2026
Merged

[refactor] remove unnecessary padding in MoE#2774
tianyu-l merged 1 commit into
mainfrom
padding

Conversation

@tianyu-l

@tianyu-l tianyu-l commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

including

  • remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm.
  • replace the kernel with a torch-native generate_permutation_indices impl, which doesn't do padding.
  • added TorchAOExpertParallel to do permutation+padding during EP using the same triton kernel moved to torchao.
  • added a llama4_debugmodel_fp8 in llama4 config_registry to demonstrate
  • refactored HybridEP to take pad_multiple as input so that it works with MXFP8.
  • flatten common/moe/ folder by moving moe.py and moe_deepep.py to common/ and remove utils.py
  • remove dual_pipe_v.py as it doesn't compose with SAC and is not maintained.
  • disabled compile.enable in llama4 and gpt-oss CI as they break the CI even on main, tracking in PP + Compile breaking CI #2771 and gpt-oss + compile breaking CI #2776, respectively

Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.

@tianyu-l tianyu-l requested a review from danielvegamyhre April 1, 2026 04:24
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026
@tianyu-l tianyu-l requested a review from acisseJZhong April 1, 2026 04:25
@tianyu-l tianyu-l force-pushed the padding branch 3 times, most recently from 8ab5a1e to 7566b6d Compare April 1, 2026 06:34
Comment on lines +37 to +40
PAD_MULTIPLE_MAP: dict[str, int] = {
"float8": 16,
"mxfp8": 32,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PAD_MULTIPLE_MAP: dict[str, int] = {
"float8": 16,
"mxfp8": 32,
}
class PAD_MULTIPLE_MAP(IntEnum):
float8: 16
mxfp8: 32

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to

Comment thread torchtitan/experiments/autoparallel/deepseek_v3/parallelize_deepseekv3.py Outdated
Comment thread torchtitan/distributed/expert_parallel.py Outdated

def _unpermute(out, input_shape, permuted_indices):
out_unpermuted = out.new_empty(input_shape)
out_unpermuted[permuted_indices, :] = out

@danielvegamyhre danielvegamyhre Apr 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test mxfp8, or only bf16? in my impl i found that for the standard EP case + mxfp8, the extra padding row added at index -1 by the EP per-group padding logic (generate_permute_indices generates -1 indexes, then permute() implements padding by selecting this row in permuted = tokens[permuted_indices, :]) needs to be removed as part of unpermute. However, for the non-EP case, padding and unpadding will have been done by torchao CUDA kernels, and no such row removal is needed. Does your implementation handle this differently?

    def _token_combine(
        self, mod: nn.Module, routed_output: Tensor, device_mesh: DeviceMesh
    ) -> Tensor:
        # If per group padding was done to prepare for MXFP8 grouped mm, there is an extra 'padding' row that
        # `permuted_indices` selects from to add padding into the routed_input in dispatch.
        routed_output = _unpermute(
            routed_output,
            self.input_shape,
            self.permuted_indices,
            remove_padding_row=True,
        )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielvegamyhre
This is the bf16 path. For mxfp8 please take a look at the _unpermute in TorchAOExpertParallel.

No I haven't tested mxfp8 because I don't have Blackwell dev machine. But I probably should try to get one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see now - lgtm

@fegin fegin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a UT for _generate_permute_indices. Now that it is a pure torch function, not a Triton kernel, we should have an unittest for this one.

@tianyu-l tianyu-l merged commit fe80b63 into main Apr 2, 2026
22 of 34 checks passed
@tianyu-l tianyu-l deleted the padding branch April 2, 2026 07:55
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 3, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 3, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 3, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 3, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 11, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 11, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 11, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 11, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
TXacs pushed a commit to McmillanTAC/torchtitan that referenced this pull request Apr 13, 2026
including
- remove triton kernel for permutation+padding as it is not required by
default bf16 grouped mm.
- replace the kernel with a torch-native `generate_permutation_indices`
impl, which doesn't do padding.
- added `TorchAOExpertParallel` to do permutation+padding during EP
using the same triton kernel moved to torchao.
- added a `llama4_debugmodel_fp8` in llama4 `config_registry` to
demonstrate
- refactored HybridEP to take `pad_multiple` as input so that it works
with MXFP8.
- flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to
`common/` and remove `utils.py`
- remove `dual_pipe_v.py` as it doesn't compose with SAC and is not
maintained.
- disabled `compile.enable` in llama4 and gpt-oss CI as they break the
CI even on main, tracking in
pytorch#2771 and
pytorch#2776, respectively

Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise
identical.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Drop the ep_enabled parameter — mark_dynamic is now applied
unconditionally (harmless without EP) and the idempotency check uses a
simple _tt_compiled attribute instead of qualname matching.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed after fixing the symbolic shape issue in
_generate_permute_indices.

Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0])
instead of torch.arange(total), reusing the unbacked symint from
repeat_interleave rather than creating a redundant one that produces
an Eq(u1, u2) constraint inductor cannot lower.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op. This also eliminates the need for torch._check
guards on the unbacked symint.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed now that the CI uses cu130+ nightly which
handles the unbacked-symint Eq(u1, u2) constraints in inductor.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op.
weifengpy added a commit to weifengpy/torchtitan that referenced this pull request Apr 14, 2026
Consolidate apply_compile_dense and apply_compile_sparse into a single
apply_compile function. The only difference was capture_scalar_outputs
which is harmless for dense models.

Remove the _run_experts_grouped_mm separate compile boundary and EP
wrapper — no longer needed now that the CI uses cu130+ nightly which
handles the unbacked-symint Eq(u1, u2) constraints in inductor.

Remove the x[:total_tokens] slice in _run_experts_for_loop — padding
was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and
the slice is a no-op.
ACharacterInASimulation pushed a commit to ACharacterInASimulation/torchtitan that referenced this pull request Apr 21, 2026
including
- remove triton kernel for permutation+padding as it is not required by
default bf16 grouped mm.
- replace the kernel with a torch-native `generate_permutation_indices`
impl, which doesn't do padding.
- added `TorchAOExpertParallel` to do permutation+padding during EP
using the same triton kernel moved to torchao.
- added a `llama4_debugmodel_fp8` in llama4 `config_registry` to
demonstrate
- refactored HybridEP to take `pad_multiple` as input so that it works
with MXFP8.
- flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to
`common/` and remove `utils.py`
- remove `dual_pipe_v.py` as it doesn't compose with SAC and is not
maintained.
- disabled `compile.enable` in llama4 and gpt-oss CI as they break the
CI even on main, tracking in
pytorch#2771 and
pytorch#2776, respectively

Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise
identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants