[DeviceMesh] Simplify unflatten method by lw · Pull Request #165556 · pytorch/pytorch

lw · 2025-10-15T16:36:45Z

Stack from ghstack (oldest at bottom):

By adding a few small helpers (e.g., a splice method to _MeshLayout, and making _init_process_groups static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-10-15T16:36:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165556

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5591123 with merge base 5d4da26 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 3078507 Pull-Request: #165556

lw · 2025-10-15T16:54:30Z

torch/distributed/_mesh_layout.py

        layout = complement(self, world_size)
        return _MeshLayout(layout.shape, layout.stride)

-    def unflatten(self, dim: int, unflatten_sizes: tuple[int, ...]) -> "_MeshLayout":


I know that it was me who insisted for making this a method of _MeshLayout, but I hadn't realized that we then needed to re-extract the sub-layout in order to create the ProcessGroups. This made for some bulky code.

Thus my new suggestion is that we break this monolithic method into two: a composition one (which already exists) and a splice one. This allows the DeviceMesh to achieve the same result as this method in two lines of code, while easily getting access to the intermediate value it needs.

lw · 2025-10-15T16:55:05Z

torch/distributed/device_mesh.py

            return _get_default_group()

+        @staticmethod
        def _init_process_groups(


This could also be extracted to become a global private helper function. I didn't do it as I wanted to keep the diff small.

lw · 2025-10-15T16:55:53Z

torch/distributed/_mesh_layout.py

-        sizes = list(self.sizes)  # type: ignore[arg-type]
-        strides = list(self.strides)  # type: ignore[arg-type]
-        unflatten_layout = self[dim].composition(
-            _MeshLayout(tuple(unflatten_sizes), suffix_product(unflatten_sizes))
-        )
-        sizes[dim : dim + 1] = list(unflatten_layout.sizes)  # type: ignore[arg-type]
-        strides[dim : dim + 1] = list(unflatten_layout.strides)  # type: ignore[arg-type]


Note that the # type: ignore[arg-types] were hiding actual bugs, since .sizes and .strides could be integers, but we can't pass these to list(...)! The new code fixes this thanks to as_tuple.

torch/distributed/device_mesh.py

fduwjj · 2025-10-15T18:54:04Z

torch/distributed/device_mesh.py

+                dim_group_names = self._dim_group_names.copy()
+                dim_group_names[dim : dim + 1] = self._init_process_groups(
+                    partial_layout,
+                    root_mesh._global_rank_permutation,
+                    mesh_dim_names,
+                    backend_override,


I like this one so that we can reuse it later if we only want to init backend for some dims not all. And also to make _flatten generic like what Tensor is doing also needs this.

fduwjj · 2025-10-15T18:56:50Z

torch/distributed/_mesh_layout.py

-        sizes[dim : dim + 1] = list(unflatten_layout.sizes)  # type: ignore[arg-type]
-        strides[dim : dim + 1] = list(unflatten_layout.strides)  # type: ignore[arg-type]
+    def splice(self, start: int, end: int, layout: "_MeshLayout") -> "_MeshLayout":
+        sizes = list(as_tuple(self.sizes))


with as_tuple and flatten, do we still want to do the proposal of limiting the sizes and strides to be at most 2d?

The way I see it, as_tuple and flatten are the tools we need to make the code correct, but moving from IntTuple to tuple[tuple[int, ...], ...] is what will make the type checker help us detect those bugs in the first place.

fduwjj

this change looks reasonable to me as well. And we can build the shared_state on top of that so that we can also cache the PG for unflatten.

fduwjj · 2025-10-15T22:37:03Z

Shall we land this before #165555? So that I can continue working on unflatten and PG caching? I think we will need to have that PR out by the end of October.

[ghstack-poisoned]

ghstack-source-id: be7f13b Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: 56a0709 Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: c83ac69 Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: 3fe55d2 Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: 8af73fa Pull-Request: #165556

fduwjj · 2025-10-16T18:28:40Z

@pytorchbot merge

pytorchmergebot · 2025-10-16T18:30:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit 86fd4fc. Reverted #165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](#165554 (comment)))

pytorchmergebot · 2025-10-16T20:41:40Z

@lw your PR has been reverted as part of the stack under #165554.

[ghstack-poisoned]

ghstack-source-id: 6736782 Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: a5f80ee Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: 4f51a2b Pull-Request: #165556

[ghstack-poisoned]

ghstack-source-id: 3ab51d7 Pull-Request: #165556

lw · 2025-10-17T17:49:59Z

@pytorchbot merge -i

pytorchmergebot · 2025-10-17T17:51:49Z

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

This reverts commit 86fd4fc. Reverted pytorch#165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](pytorch#165554 (comment)))

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

This reverts commit 86fd4fc. Reverted pytorch#165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](pytorch#165554 (comment)))

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

Update

e2092a9

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2025

lw added a commit that referenced this pull request Oct 15, 2025

[DeviceMesh] Simplify unflatten method

b3cb26c

ghstack-source-id: 3078507 Pull-Request: #165556

lw commented Oct 15, 2025

View reviewed changes

fduwjj reviewed Oct 15, 2025

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

fduwjj reviewed Oct 15, 2025

View reviewed changes

fduwjj approved these changes Oct 15, 2025

View reviewed changes

lw added the topic: not user facing topic category label Oct 16, 2025

Update

f66beb7

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 16, 2025

[DeviceMesh] Simplify unflatten method

a4ce46e

ghstack-source-id: be7f13b Pull-Request: #165556

Update

3ff7758

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 16, 2025

[DeviceMesh] Simplify unflatten method

820368e

ghstack-source-id: 56a0709 Pull-Request: #165556

Update

c341341

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 16, 2025

[DeviceMesh] Simplify unflatten method

21a874e

ghstack-source-id: c83ac69 Pull-Request: #165556

Update

0386ed6

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 16, 2025

[DeviceMesh] Simplify unflatten method

7b9c1f5

ghstack-source-id: 3fe55d2 Pull-Request: #165556

Update

0ed5422

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 16, 2025

[DeviceMesh] Simplify unflatten method

cd6bba3

ghstack-source-id: 8af73fa Pull-Request: #165556

lw added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2025

pytorchmergebot added the merging label Oct 16, 2025

pytorchmergebot added the Merged label Oct 16, 2025

pytorchmergebot closed this in 86fd4fc Oct 16, 2025

pytorchmergebot removed the merging label Oct 16, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 16, 2025

pytorchmergebot reopened this Oct 16, 2025

Update

1470600

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 17, 2025

[DeviceMesh] Simplify unflatten method

48e1dee

ghstack-source-id: 6736782 Pull-Request: #165556

Update

dfd853f

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 17, 2025

[DeviceMesh] Simplify unflatten method

d183443

ghstack-source-id: a5f80ee Pull-Request: #165556

Update

5dc1896

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 17, 2025

[DeviceMesh] Simplify unflatten method

c28130c

ghstack-source-id: 4f51a2b Pull-Request: #165556

Update

5591123

[ghstack-poisoned]

lw added a commit that referenced this pull request Oct 17, 2025

[DeviceMesh] Simplify unflatten method

894857c

ghstack-source-id: 3ab51d7 Pull-Request: #165556

pytorchmergebot added the merging label Oct 17, 2025

pytorchmergebot closed this in 0d4c2b7 Oct 17, 2025

pytorchmergebot removed the merging label Oct 17, 2025

github-actions bot deleted the gh/lw/10/head branch November 17, 2025 02:18

Conversation

lw commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165556

✅ No Failures

Uh oh!

lw Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lw Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lw Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fduwjj Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lw Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Oct 15, 2025

Uh oh!

fduwjj commented Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 16, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 16, 2025

Uh oh!

lw commented Oct 17, 2025

Uh oh!

pytorchmergebot commented Oct 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lw commented Oct 15, 2025 •

edited

Loading

pytorch-bot bot commented Oct 15, 2025 •

edited

Loading

lw Oct 15, 2025 •

edited

Loading