Skip to content

[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks#165555

Closed
lw wants to merge 10 commits intogh/lw/9/basefrom
gh/lw/9/head
Closed

[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks#165555
lw wants to merge 10 commits intogh/lw/9/basefrom
gh/lw/9/head

Conversation

@lw
Copy link
Contributor

@lw lw commented Oct 15, 2025

Stack from ghstack (oldest at bottom):

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized mesh Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes _layout, _global_rank_permutation, and mesh_dim_names.

With such a constructor we can effectively simplify a lot of callsites and get rid of the _create_mesh_from_ranks helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165555

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d673d5b with merge base 5d4da26 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside the init override part, the rest looks good to me

@lw lw added the topic: not user facing topic category label Oct 16, 2025
lw added 4 commits October 16, 2025 13:32
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

[ghstack-poisoned]
@lw lw added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2025
self,
device_type: str,
mesh: Union[torch.Tensor, "ArrayLike"],
mesh: Optional[Union[torch.Tensor, "ArrayLike"]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the docstring to explain when mesh can be None? Even if we just preserve it for the internal usage, we should mention because this is a public API.

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #165556

pytorchmergebot pushed a commit that referenced this pull request Oct 16, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: #165556
Approved by: https://github.com/fduwjj
ghstack dependencies: #165554, #165555
@pytorchmergebot
Copy link
Collaborator

@lw your PR has been reverted as part of the stack under #165554.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 16, 2025
lw added 3 commits October 17, 2025 09:51
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #165556

pytorchmergebot pushed a commit that referenced this pull request Oct 17, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: #165556
Approved by: https://github.com/fduwjj
ghstack dependencies: #165554, #165555
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…om_ranks (pytorch#165555)

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.

With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

Pull Request resolved: pytorch#165555
Approved by: https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: pytorch#165554
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: pytorch#165556
Approved by: https://github.com/fduwjj
ghstack dependencies: pytorch#165554, pytorch#165555
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…om_ranks (pytorch#165555)

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.

With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

Pull Request resolved: pytorch#165555
Approved by: https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: pytorch#165554
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: pytorch#165556
Approved by: https://github.com/fduwjj
ghstack dependencies: pytorch#165554, pytorch#165555
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…om_ranks (pytorch#165555)

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.

With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

Pull Request resolved: pytorch#165555
Approved by: https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: pytorch#165554
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: pytorch#165556
Approved by: https://github.com/fduwjj
ghstack dependencies: pytorch#165554, pytorch#165555
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…om_ranks (pytorch#165555)

The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.

In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.

This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.

With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.

Pull Request resolved: pytorch#165555
Approved by: https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: pytorch#165554
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.

Pull Request resolved: pytorch#165556
Approved by: https://github.com/fduwjj
ghstack dependencies: pytorch#165554, pytorch#165555
@github-actions github-actions bot deleted the gh/lw/9/head branch November 17, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants