[DTensor] fix redistribute cost crashing on non-participating ranks by wconstab · Pull Request #172478 · pytorch/pytorch

wconstab · 2026-01-14T19:41:45Z

Stack from ghstack (oldest at bottom):

Previously, ranks not participating in redistribution would hit an
assert in redistribution planner that the rank was participating.

The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching
an early exit in this PR is the best fix or rethinking the original
assert. Also cc pianpwk for discussion

@aorenste

Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by @aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc @pianpwk for discussion [ghstack-poisoned]

pytorch-bot · 2026-01-14T19:41:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172478

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 29c7796 with merge base b731ffe ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ing ranks" Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion This PR is to fix an error happening on this test: ``` with_comms def test_from_local_sub_mesh(self): mesh = DeviceMesh(self.device_type, [0, 2]) local_tensor = torch.ones(3, 4) dtensor = DTensor.from_local(local_tensor, mesh, [Shard(0)]) self.assertEqual(dtensor.size(), torch.Size([6, 4])) self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4), torch.tensor([]), dtensor.to_local(), ) # test dtensor created in submesh, the operation should only # be applied to the local shard inside the mesh, not the whole # world, so only 0/2 really run the computation dtensor = dtensor + 2 self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4) + 2, torch.tensor([]), dtensor.to_local(), ) ``` After looking at the test, I am very confused about why we support this behavior in the first place. aorenste suggested maybe we should just make DTensor.from_local error out on ranks that aren't included in the mesh. I am not sure why we want to allow python code to run on non-participating ranks, and go through dtensor dispatch, and return a dtensor object that is defunct. Claude summarized the behavior: * we run shard prop on and shape prop on every (including non-participating) rank: ``` | Question | Answer | |----------------------------------------|----------------------------------------------------------| | Value of dtensor + 2 on excluded ranks | Empty tensor torch.tensor([]) | | Has global shape? | Yes - dtensor.size() returns (6, 4) | | Has placements? | Yes - same as participating ranks | | Runs shape propagation? | Yes - output spec is computed, just no local computation | The design ensures all ranks can query DTensor properties consistently while only participating ranks do actual computation. ``` [ghstack-poisoned]

Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion ghstack-source-id: 32a07ce Pull Request resolved: pytorch/pytorch#172478

…ing ranks" Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion This PR is to fix an error happening on this test: ``` with_comms def test_from_local_sub_mesh(self): mesh = DeviceMesh(self.device_type, [0, 2]) local_tensor = torch.ones(3, 4) dtensor = DTensor.from_local(local_tensor, mesh, [Shard(0)]) self.assertEqual(dtensor.size(), torch.Size([6, 4])) self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4), torch.tensor([]), dtensor.to_local(), ) # test dtensor created in submesh, the operation should only # be applied to the local shard inside the mesh, not the whole # world, so only 0/2 really run the computation dtensor = dtensor + 2 self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4) + 2, torch.tensor([]), dtensor.to_local(), ) ``` After looking at the test, I am very confused about why we support this behavior in the first place. aorenste suggested maybe we should just make DTensor.from_local error out on ranks that aren't included in the mesh. I am not sure why we want to allow python code to run on non-participating ranks, and go through dtensor dispatch, and return a dtensor object that is defunct. Claude summarized the behavior: * we run shard prop on and shape prop on every (including non-participating) rank: ``` | Question | Answer | |----------------------------------------|----------------------------------------------------------| | Value of dtensor + 2 on excluded ranks | Empty tensor torch.tensor([]) | | Has global shape? | Yes - dtensor.size() returns (6, 4) | | Has placements? | Yes - same as participating ranks | | Runs shape propagation? | Yes - output spec is computed, just no local computation | The design ensures all ranks can query DTensor properties consistently while only participating ranks do actual computation. ``` [ghstack-poisoned]

Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion ghstack-source-id: 7913d7e Pull Request resolved: pytorch/pytorch#172478

…ing ranks" Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion This PR is to fix an error happening on this test: ``` with_comms def test_from_local_sub_mesh(self): mesh = DeviceMesh(self.device_type, [0, 2]) local_tensor = torch.ones(3, 4) dtensor = DTensor.from_local(local_tensor, mesh, [Shard(0)]) self.assertEqual(dtensor.size(), torch.Size([6, 4])) self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4), torch.tensor([]), dtensor.to_local(), ) # test dtensor created in submesh, the operation should only # be applied to the local shard inside the mesh, not the whole # world, so only 0/2 really run the computation dtensor = dtensor + 2 self.sub_mesh_assert_equal( mesh.mesh, torch.ones(3, 4) + 2, torch.tensor([]), dtensor.to_local(), ) ``` After looking at the test, I am very confused about why we support this behavior in the first place. aorenste suggested maybe we should just make DTensor.from_local error out on ranks that aren't included in the mesh. I am not sure why we want to allow python code to run on non-participating ranks, and go through dtensor dispatch, and return a dtensor object that is defunct. Claude summarized the behavior: * we run shard prop on and shape prop on every (including non-participating) rank: ``` | Question | Answer | |----------------------------------------|----------------------------------------------------------| | Value of dtensor + 2 on excluded ranks | Empty tensor torch.tensor([]) | | Has global shape? | Yes - dtensor.size() returns (6, 4) | | Has placements? | Yes - same as participating ranks | | Runs shape propagation? | Yes - output spec is computed, just no local computation | The design ensures all ranks can query DTensor properties consistently while only participating ranks do actual computation. ``` [ghstack-poisoned]

…ing ranks" Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in #169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion [ghstack-poisoned]

pianpwk

is there a possibility that shard prop for non-participating ranks might now cache in suboptimal redistribute decisions?

wconstab · 2026-01-26T20:08:42Z

is there a possibility that shard prop for non-participating ranks might now cache in suboptimal redistribute decisions?

I think the answer should be no: we include the mesh in the cache-key, so if later we tried to run the same op including more ranks, that would imply a different mesh and we would not cache-hit.

wconstab · 2026-01-26T20:09:07Z

@pytorchbot merge

pytorchmergebot · 2026-01-26T20:11:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#172478) Previously, ranks not participating in redistribution would hit an assert in redistribution planner that the rank was participating. The assert in question was added recently in pytorch#169548 by aorenste, and i'm not sure if patching an early exit in this PR is the best fix or rethinking the original assert. Also cc pianpwk for discussion Pull Request resolved: pytorch#172478 Approved by: https://github.com/pianpwk

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Jan 14, 2026

wconstab added 3 commits January 14, 2026 16:23

wconstab added 2 commits January 19, 2026 20:16

wconstab requested review from pianpwk and zpcore January 26, 2026 04:03

pianpwk approved these changes Jan 26, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 26, 2026

pytorchmergebot added the merging label Jan 26, 2026

pytorchmergebot added the Merged label Jan 27, 2026

pytorchmergebot closed this in 7a4d5f0 Jan 27, 2026

pytorchmergebot removed the merging label Jan 27, 2026

github-actions Bot deleted the gh/wconstab/500/head branch February 26, 2026 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] fix redistribute cost crashing on non-participating ranks#172478

[DTensor] fix redistribute cost crashing on non-participating ranks#172478
wconstab wants to merge 8 commits intogh/wconstab/500/basefrom
gh/wconstab/500/head

wconstab commented Jan 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

pianpwk left a comment

Uh oh!

wconstab commented Jan 26, 2026

Uh oh!

wconstab commented Jan 26, 2026

Uh oh!

pytorchmergebot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wconstab commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172478

⏳ No Failures, 1 Pending

Uh oh!

pianpwk left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jan 26, 2026

Uh oh!

wconstab commented Jan 26, 2026

Uh oh!

pytorchmergebot commented Jan 26, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Jan 14, 2026 •

edited

Loading

pytorch-bot Bot commented Jan 14, 2026 •

edited

Loading