Conversation
8ab5a1e to
7566b6d
Compare
| PAD_MULTIPLE_MAP: dict[str, int] = { | ||
| "float8": 16, | ||
| "mxfp8": 32, | ||
| } |
There was a problem hiding this comment.
| PAD_MULTIPLE_MAP: dict[str, int] = { | |
| "float8": 16, | |
| "mxfp8": 32, | |
| } | |
| class PAD_MULTIPLE_MAP(IntEnum): | |
| float8: 16 | |
| mxfp8: 32 |
There was a problem hiding this comment.
I'm trying to
- be consistent with https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/quantization/float8.py#L38
- be consistent with
https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/__init__.py#L9 - in general torchtitan uses
str/Literalinstead ofEnum
|
|
||
| def _unpermute(out, input_shape, permuted_indices): | ||
| out_unpermuted = out.new_empty(input_shape) | ||
| out_unpermuted[permuted_indices, :] = out |
There was a problem hiding this comment.
did you test mxfp8, or only bf16? in my impl i found that for the standard EP case + mxfp8, the extra padding row added at index -1 by the EP per-group padding logic (generate_permute_indices generates -1 indexes, then permute() implements padding by selecting this row in permuted = tokens[permuted_indices, :]) needs to be removed as part of unpermute. However, for the non-EP case, padding and unpadding will have been done by torchao CUDA kernels, and no such row removal is needed. Does your implementation handle this differently?
def _token_combine(
self, mod: nn.Module, routed_output: Tensor, device_mesh: DeviceMesh
) -> Tensor:
# If per group padding was done to prepare for MXFP8 grouped mm, there is an extra 'padding' row that
# `permuted_indices` selects from to add padding into the routed_input in dispatch.
routed_output = _unpermute(
routed_output,
self.input_shape,
self.permuted_indices,
remove_padding_row=True,
)There was a problem hiding this comment.
@danielvegamyhre
This is the bf16 path. For mxfp8 please take a look at the _unpermute in TorchAOExpertParallel.
No I haven't tested mxfp8 because I don't have Blackwell dev machine. But I probably should try to get one.
fegin
left a comment
There was a problem hiding this comment.
Should we add a UT for _generate_permute_indices. Now that it is a pure torch function, not a Triton kernel, we should have an unittest for this one.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
including - remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm. - replace the kernel with a torch-native `generate_permutation_indices` impl, which doesn't do padding. - added `TorchAOExpertParallel` to do permutation+padding during EP using the same triton kernel moved to torchao. - added a `llama4_debugmodel_fp8` in llama4 `config_registry` to demonstrate - refactored HybridEP to take `pad_multiple` as input so that it works with MXFP8. - flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to `common/` and remove `utils.py` - remove `dual_pipe_v.py` as it doesn't compose with SAC and is not maintained. - disabled `compile.enable` in llama4 and gpt-oss CI as they break the CI even on main, tracking in pytorch#2771 and pytorch#2776, respectively Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Drop the ep_enabled parameter — mark_dynamic is now applied unconditionally (harmless without EP) and the idempotency check uses a simple _tt_compiled attribute instead of qualname matching. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed after fixing the symbolic shape issue in _generate_permute_indices. Fix _generate_permute_indices to use torch.arange(seg_ids.shape[0]) instead of torch.arange(total), reusing the unbacked symint from repeat_interleave rather than creating a redundant one that produces an Eq(u1, u2) constraint inductor cannot lower. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op. This also eliminates the need for torch._check guards on the unbacked symint.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed now that the CI uses cu130+ nightly which handles the unbacked-symint Eq(u1, u2) constraints in inductor. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed now that the CI uses cu130+ nightly which handles the unbacked-symint Eq(u1, u2) constraints in inductor. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.
including - remove triton kernel for permutation+padding as it is not required by default bf16 grouped mm. - replace the kernel with a torch-native `generate_permutation_indices` impl, which doesn't do padding. - added `TorchAOExpertParallel` to do permutation+padding during EP using the same triton kernel moved to torchao. - added a `llama4_debugmodel_fp8` in llama4 `config_registry` to demonstrate - refactored HybridEP to take `pad_multiple` as input so that it works with MXFP8. - flatten `common/moe/` folder by moving `moe.py` and `moe_deepep.py` to `common/` and remove `utils.py` - remove `dual_pipe_v.py` as it doesn't compose with SAC and is not maintained. - disabled `compile.enable` in llama4 and gpt-oss CI as they break the CI even on main, tracking in pytorch#2771 and pytorch#2776, respectively Numerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.
including
generate_permutation_indicesimpl, which doesn't do padding.TorchAOExpertParallelto do permutation+padding during EP using the same triton kernel moved to torchao.llama4_debugmodel_fp8in llama4config_registryto demonstratepad_multipleas input so that it works with MXFP8.common/moe/folder by movingmoe.pyandmoe_deepep.pytocommon/and removeutils.pydual_pipe_v.pyas it doesn't compose with SAC and is not maintained.compile.enablein llama4 and gpt-oss CI as they break the CI even on main, tracking in PP + Compile breaking CI #2771 and gpt-oss + compile breaking CI #2776, respectivelyNumerics of before vs. after on llama4 debugmodel [dp8, ep4] are bitwise identical.