Skip to content

[DTensor] single dim fix inplace op expansion#172477

Closed
wconstab wants to merge 5 commits intogh/wconstab/499/basefrom
gh/wconstab/499/head
Closed

[DTensor] single dim fix inplace op expansion#172477
wconstab wants to merge 5 commits intogh/wconstab/499/basefrom
gh/wconstab/499/head

Conversation

@wconstab
Copy link
Copy Markdown
Contributor

@wconstab wconstab commented Jan 14, 2026

Stack from ghstack (oldest at bottom):

This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172477

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit 8b18f74 with merge base 879fd46 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

[ghstack-poisoned]
Comment thread docs/source/distributed.tensor.md Outdated

**Left-to-Right Evaluation Order**

DTensor evaluates Partial placements in **left-to-right order** (i.e., mesh dimension 0 first,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the redistribution code, left-to-right order is not completely correct.
The Partial-> Replicate can be right-to-left order:

if current != target:
transform_infos.append(
_TransformInfo(
mesh_dim=mesh_dim,
src_dst_placements=(current, target),
logical_shape=mesh_dims_to_logical_shape[mesh_dim],
)
)
current_placements[mesh_dim] = target

or left-to-right:
if current != target:
transform_infos.append(
_TransformInfo(
mesh_dim=mesh_dim,
src_dst_placements=(current, target),
logical_shape=mesh_dims_to_logical_shape[mesh_dim],
)
)
current_placements[mesh_dim] = target

But in general, it should be either one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have Shard in src placement, we can trigger right-to-left:

from torch.utils._debug_mode import DebugMode
class DistributeWithPartialTest(DTensorTestBase):
    @property
    def world_size(self) -> int:
        return 8

    def _extract_redistribute_trace_from_debug_mode(self, s: str) -> str:
        import re

        match = re.search(r"trace:\s*(.*)\)", s)
        if match:
            trace_str = match.group(1)
            return trace_str
        else:
            return ""

    @with_comms
    def test_tmp(self):
        mesh = init_device_mesh(self.device_type, (2, 2, 2))
        input_data = torch.randn((8, 8, 8), device=self.device_type)
        dt = DTensor.from_local(input_data, mesh, (Partial('sum'), Shard(0), Partial('max')))
        with DebugMode(record_torchfunction=False) as debug_mode:
            dt2 = dt.redistribute(mesh, (Replicate(), Replicate(), Replicate()))
        trace_str = self._extract_redistribute_trace_from_debug_mode(debug_mode.debug_string())
        print(trace_str)

The redistribution path is: P(sum)S(0)P(max)->P(sum)S(0)R->P(sum)RR->RRR. This made me think that there is a bug with the greedy redistribution algorithm when handling Partial with variant. We can't arbitrarily switch between left-to-right or right-to-left order. I think right-to-left should be the correct order to handle Partial->Replicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. However don't move too fast on it, we are discussing just banning mixed partial types existing in the same placement and then we don't need to define an
order

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this PR was not even supposed to include the changes to partial order. it was a rebasing mistake. I will remove that code. I have another PR up now anyway that partially bans mixed partials. If people are OK with that direction, i'll extend that and ensure we don't allow mixed partials at all parts of the DTensor stack. then this ordering stuff becomes unnecessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "banning mixed Partial" sounds good to me!

if isinstance(placement, Partial):
src_partial_ops[mesh_dim] = placement.reduce_op

if len(src_partial_ops) > 1:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. I left the comment in README file too early. This PR is enforcing the left-to-right order when there exists multiple Partial variants.


# If some partials are being reduced while others are kept,
# we need to reduce ALL partials first, then re-partition
if dst_partial_dims and dst_partial_dims != src_partial_dims:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the condition dst_partial_dims and dst_partial_dims != src_partial_dims doesn't hold, then we will use the default greedy code to handle partial, which can either be left-to-right or reverse. We also need to enforce that order.

This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

[ghstack-poisoned]
Copy link
Copy Markdown
Member

@zpcore zpcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

SergeyTyshkevich pushed a commit to SergeyTyshkevich/chart2 that referenced this pull request Jan 19, 2026
This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

ghstack-source-id: 0a32d7e
Pull Request resolved: pytorch/pytorch#172477
SergeyTyshkevich pushed a commit to SergeyTyshkevich/chart2 that referenced this pull request Jan 19, 2026
This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

ghstack-source-id: a1532d7
Pull Request resolved: pytorch/pytorch#172477
This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

[ghstack-poisoned]
@wconstab
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 20, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

[ghstack-poisoned]
@wconstab
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -i

suncapitalllc007-star pushed a commit to suncapitalllc007-star/pytorch that referenced this pull request Jan 25, 2026
This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

ghstack-source-id: 7f3bdea
Pull Request resolved: pytorch/pytorch#172477
suncapitalllc007-star pushed a commit to suncapitalllc007-star/pytorch that referenced this pull request Jan 25, 2026
This enables the inplace filtering logic that skips strategies with
incompatible input placements.

Previously, inplace ops were able to generate expanded strategies
incompatible with the current input placements, which aren't allowed to
be modified by redistribution.

ghstack-source-id: 1a85756
Pull Request resolved: pytorch/pytorch#172477
@github-actions github-actions Bot deleted the gh/wconstab/499/head branch February 20, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants