[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks by lw · Pull Request #165555 · pytorch/pytorch

lw · 2025-10-15T16:36:37Z

Stack from ghstack (oldest at bottom):

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized mesh Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes _layout, _global_rank_permutation, and mesh_dim_names.

With such a constructor we can effectively simplify a lot of callsites and get rid of the _create_mesh_from_ranks helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-10-15T16:36:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165555

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d673d5b with merge base 5d4da26 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh) (similar failure)
test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_optimizer_single_tensor_pattern

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/distributed/device_mesh.py

fduwjj

Aside the init override part, the rest looks good to me

[ghstack-poisoned]

fduwjj

LGTM

[ghstack-poisoned]

fegin · 2025-10-16T16:32:18Z

torch/distributed/device_mesh.py

            self,
            device_type: str,
-            mesh: Union[torch.Tensor, "ArrayLike"],
+            mesh: Optional[Union[torch.Tensor, "ArrayLike"]] = None,


Can we update the docstring to explain when mesh can be None? Even if we just preserve it for the internal usage, we should mention because this is a public API.

pytorchmergebot · 2025-10-16T18:30:24Z

Starting merge as part of PR stack under #165556

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: #165556 Approved by: https://github.com/fduwjj ghstack dependencies: #165554, #165555

…_mesh_from_ranks (#165555)" This reverts commit 99097b6. Reverted #165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](#165554 (comment)))

pytorchmergebot · 2025-10-16T20:41:45Z

@lw your PR has been reverted as part of the stack under #165554.

[ghstack-poisoned]

pytorchmergebot · 2025-10-17T17:51:53Z

Starting merge as part of PR stack under #165556

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: #165556 Approved by: https://github.com/fduwjj ghstack dependencies: #165554, #165555

…om_ranks (pytorch#165555) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: pytorch#165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: pytorch#165554

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

…_mesh_from_ranks (pytorch#165555)" This reverts commit 99097b6. Reverted pytorch#165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](pytorch#165554 (comment)))

…om_ranks (pytorch#165555) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: pytorch#165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: pytorch#165554

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

…om_ranks (pytorch#165555) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: pytorch#165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: pytorch#165554

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

…_mesh_from_ranks (pytorch#165555)" This reverts commit 99097b6. Reverted pytorch#165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see https://hud.pytorch.org/hud/pytorch/pytorch/aba8c43594a83772281a62a7961c0b6ddcff321d/1?per_page=50&name_filter=distributed%2C%201&mergeEphemeralLF=true ([comment](pytorch#165554 (comment)))

…om_ranks (pytorch#165555) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: pytorch#165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: pytorch#165554

By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: pytorch#165556 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#165554, pytorch#165555

Update

9018b05

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2025

This was referenced Oct 15, 2025

[DeviceMesh] Fix layout calculation when flattening non-contiguous dims #165542

Closed

[DeviceMesh] Prefer using _layout over _mesh for all sorts of things #165554

Closed

[DeviceMesh] Simplify unflatten method #165556

Closed

lw commented Oct 15, 2025

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

fduwjj reviewed Oct 15, 2025

View reviewed changes

lw added the topic: not user facing topic category label Oct 16, 2025

lw added 4 commits October 16, 2025 13:32

Update

eeb7f23

[ghstack-poisoned]

Update

5ce6af6

[ghstack-poisoned]

Update

70c5b0a

[ghstack-poisoned]

Update

09a5dfc

[ghstack-poisoned]

fduwjj approved these changes Oct 16, 2025

View reviewed changes

Update

ce21ea1

[ghstack-poisoned]

lw added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2025

fegin approved these changes Oct 16, 2025

View reviewed changes

pytorchmergebot closed this in 99097b6 Oct 16, 2025

pytorchmergebot added the Merged label Oct 16, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 16, 2025

pytorchmergebot reopened this Oct 16, 2025

lw added 3 commits October 17, 2025 09:51

Update

c7acb44

[ghstack-poisoned]

Update

0152552

[ghstack-poisoned]

Update

33839b3

[ghstack-poisoned]

Update

d673d5b

[ghstack-poisoned]

pytorchmergebot closed this in d659bbd Oct 17, 2025

github-actions bot deleted the gh/lw/9/head branch November 17, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks#165555

[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks#165555
lw wants to merge 10 commits intogh/lw/9/basefrom
gh/lw/9/head

lw commented Oct 15, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

fduwjj left a comment

Uh oh!

fduwjj left a comment

Uh oh!

fegin Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lw commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165555

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 16, 2025

Uh oh!

pytorchmergebot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lw commented Oct 15, 2025 •

edited

Loading

pytorch-bot bot commented Oct 15, 2025 •

edited

Loading