[DTensor] fix max.dim/min.dim strategy by pianpwk · Pull Request #175776 · pytorch/pytorch

pianpwk · 2026-02-25T19:46:44Z

Stack from ghstack (oldest at bottom):

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy.

The previous strategy allowed Partial("max"/"min") on values when the input was sharded on the reduction dim. While Partial is valid for values, the indices are local to each shard and cannot be combined across ranks — producing incorrect global indices. Rewrite as a single_dim_strategy that only allows sharding on non-reduction dims, forcing Replicate on the reduction dim so both values and indices are computed correctly. [ghstack-poisoned]

pytorch-bot · 2026-02-25T19:46:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175776

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit feea529 with merge base ea9fce2 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge) (gh) (similar failure)
'Test'
trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx950.1) (gh) (similar failure)
test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_conv3d_cuda_float32
trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.4) (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2026-02-25T23:54:02Z

+@register_single_dim_strategy(
+    [aten.max.dim, aten.min.dim], schema_info=RuntimeSchemaInfo(1)
+)
+def max_min_dim_single_dim_strategy(


@anshul-si i'm deferring to you on this, i think your PRs do not touch max.dim and min.dim so it is OK to land this first? Not sure if you were planning to work on the ops in _math_ops at some point?

this can be landed first. i was planning on working on op in math_ops after pointwise_ops, but this can be used to help me

wconstab · 2026-02-25T23:57:54Z

+                _ShardingPlaceholder(d),
+            ]
+        )
+    return strategies


aren't we forgetting some Partial prop from this rule?

I thought so too, but max(P(max)) -> P(max) values, but invalid indices, so we can't, regardless of reduction dimensions

i see, we definitely can't return partial indices. i missed that indices was a return value despite your comment above. makes sense.

wconstab

does the sharding validator run on this op + PR?

[ghstack-poisoned]

pianpwk · 2026-02-26T18:55:03Z

@wconstab I'm not sure about the result, but these seem to fall under the reduction_with_dim variant, and I don't see any missing rules for max/min.dim:

(pytorch-3048) [pianpwk@devvm3048.dkl0 /data/users/pianpwk/pytorch (d63a1ff5)]$ python -m torch.distributed.tensor._ops.strategy_validation --op max
Testing ops: aten.max
Device: cuda, Dtype: torch.float32, World size: 2

  OpInfo variant: reduction_with_dim

  OpInfo variant: reduction_no_dim

  OpInfo variant: binary

[1/1] aten.max — Samples: 14 (1 skipped), Combinations: 2907
----------------------------------------------------------------------

Possibly missing (valid in ground truth but no DTensor rule)

  [aten.max.default]
    P(max) -> P(max)

  [aten.maximum.default]
    P(max), R -> P(max)
    P(min), P(min) -> P(min)
    P(min), R -> P(min)
    R, P(avg) -> P(avg)
    R, P(max) -> P(max)
    R, P(min) -> P(min)
    R, P(sum) -> P(sum)

======================================================================
Summary
======================================================================
Op        Correct  Incorrect  Missing    Time
---------------------------------------------
aten.max       28          0        8    45.5s
---------------------------------------------
Total          28          0        8    45.5s






(pytorch-3048) [pianpwk@devvm3048.dkl0 /data/users/pianpwk/pytorch (381ee342)]$ python -m torch.distributed.tensor._ops.strategy_validation --op min
Testing ops: aten.min
Device: cuda, Dtype: torch.float32, World size: 2

  OpInfo variant: reduction_with_dim

  OpInfo variant: reduction_no_dim

  OpInfo variant: binary

[1/1] aten.min — Samples: 14 (1 skipped), Combinations: 2907
----------------------------------------------------------------------

Possibly missing (valid in ground truth but no DTensor rule)

  [aten.min.default]
    P(min) -> P(min)

  [aten.minimum.default]
    P(max), P(max) -> P(max)
    P(max), R -> P(max)
    P(min), R -> P(min)
    R, P(avg) -> P(avg)
    R, P(avg) -> R
    R, P(max) -> P(max)
    R, P(min) -> P(min)
    R, P(sum) -> P(avg)
    R, P(sum) -> R

======================================================================
Summary
======================================================================
Op        Correct  Incorrect  Missing    Time
---------------------------------------------
aten.min       28          0       10    45.4s
---------------------------------------------
Total          28          0       10    45.4s

wconstab

LGTM, i was confused but your impl is right. no partials should be supported for max.dim variant.

pianpwk · 2026-02-26T19:34:23Z

@pytorchbot merge

pytorchmergebot · 2026-02-26T19:36:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-02-26T20:56:04Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

pianpwk · 2026-02-26T21:49:12Z

@pytorchbot merge -i

pytorchmergebot · 2026-02-26T21:51:37Z

Merge started

Your change will be merged while ignoring the following 2 checks: inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2026-02-27T00:51:50Z

@pytorchbot revert -m "Looks like introduced some new distributed breakages, see https://hud.pytorch.org/hud/pytorch/pytorch/1b9046a794cd2f8d882adf47d5612407cf43c1d2/1?per_page=50&name_filter=test%20(distr&mergeEphemeralLF=true" -c nosignal

pytorchmergebot · 2026-03-02T18:49:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2026-03-02T19:37:16Z

@pytorchbot revert -m "I'm not sure what's going on, but it breaks lint this time around, see https://hud.pytorch.org/hud/pytorch/pytorch/7c8edff72bcf501f2fb70a4b3149718a905c2471/1?per_page=50&name_filter=lint&mergeEphemeralLF=true" -c nosignal

pytorchmergebot · 2026-03-02T19:42:05Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 2f0a6bd. Reverted #175776 on behalf of https://github.com/malfet due to I'm not sure what's going on, but it breaks lint this time around, see https://hud.pytorch.org/hud/pytorch/pytorch/7c8edff72bcf501f2fb70a4b3149718a905c2471/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](#175776 (comment)))

pytorchmergebot · 2026-03-02T19:42:15Z

@pianpwk your PR has been successfully reverted.

albanD · 2026-03-02T19:42:36Z

It's because of a land race with the PR that enables lint for plain assert.
The base of this PR is too old.

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy. Pull Request resolved: pytorch#175776 Approved by: https://github.com/wconstab

This reverts commit 2f0a6bd. Reverted pytorch#175776 on behalf of https://github.com/malfet due to I'm not sure what's going on, but it breaks lint this time around, see https://hud.pytorch.org/hud/pytorch/pytorch/7c8edff72bcf501f2fb70a4b3149718a905c2471/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#175776 (comment)))

[ghstack-poisoned]

pianpwk · 2026-03-04T17:22:28Z

@pytorchbot merge

pytorchmergebot · 2026-03-04T17:25:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy. Pull Request resolved: pytorch#175776 Approved by: https://github.com/wconstab

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630 The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy Pull Request resolved: #175656 Approved by: https://github.com/wconstab ghstack dependencies: #175776

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630 The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy Pull Request resolved: pytorch#175656 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#175776

The previous strategy allowed Partial("max"/"min") on values when the input was sharded on the reduction dim. While Partial is valid for values, the indices are local to each shard and cannot be combined across ranks — producing incorrect global indices. Rewrite as a single_dim_strategy that only allows sharding on non-reduction dims, forcing Replicate on the reduction dim so both values and indices are computed correctly. ghstack-source-id: b7b09ff Pull Request resolved: pytorch/pytorch#175776

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy. Pull Request resolved: pytorch#175776 Approved by: https://github.com/wconstab

This reverts commit 07b82e8. Reverted pytorch#175776 on behalf of https://github.com/malfet due to Looks like introduced some new distributed breakages, see https://hud.pytorch.org/hud/pytorch/pytorch/1b9046a794cd2f8d882adf47d5612407cf43c1d2/1?per_page=50&name_filter=test%20(distr&mergeEphemeralLF=true ([comment](pytorch#175776 (comment)))

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy. Pull Request resolved: pytorch#175776 Approved by: https://github.com/wconstab

This reverts commit 2f0a6bd. Reverted pytorch#175776 on behalf of https://github.com/malfet due to I'm not sure what's going on, but it breaks lint this time around, see https://hud.pytorch.org/hud/pytorch/pytorch/7c8edff72bcf501f2fb70a4b3149718a905c2471/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#175776 (comment)))

aten.max/min.dim returns (values, indices), and strategies currently allow S(reduction_dim) -> P(max/min), P(max/min). This is invalid for indices, and we should ban sharding on the reduction dim. Rewrites as a single-dim strategy. Pull Request resolved: pytorch#175776 Approved by: https://github.com/wconstab

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630 The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy Pull Request resolved: pytorch#175656 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#175776

pianpwk mentioned this pull request Feb 24, 2026

[DTensor] layernorm output meta #175652

Closed

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Feb 25, 2026

pianpwk mentioned this pull request Feb 25, 2026

[DTensor] constant_pad_nd non-replicate strategy #175656

Closed

pianpwk changed the title ~~[DTensor] Fix max.dim/min.dim strategy for correct indices~~ [DTensor] fix max.dim/min.dim strategy Feb 25, 2026

pianpwk requested review from anshul-si, wconstab and zpcore February 25, 2026 20:48

wconstab reviewed Feb 25, 2026

View reviewed changes

Update

6189df2

[ghstack-poisoned]

wconstab approved these changes Feb 26, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 26, 2026

pytorchmergebot added the merging label Feb 26, 2026

pytorchmergebot removed the merging label Feb 26, 2026

pytorchmergebot added the merging label Feb 26, 2026

pytorchmergebot added the Merged label Feb 26, 2026

pytorchmergebot closed this in 07b82e8 Feb 26, 2026

pytorchmergebot removed the merging label Feb 26, 2026

pytorchmergebot closed this in 2f0a6bd Mar 2, 2026

pytorchmergebot removed the merging label Mar 2, 2026

pytorchmergebot reopened this Mar 2, 2026

Update

feea529

[ghstack-poisoned]

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot closed this in d706396 Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

github-actions Bot deleted the gh/pianpwk/102/head branch April 4, 2026 02:23

Conversation

pianpwk commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175776

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

wconstab Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

anshul-si Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

pianpwk commented Feb 26, 2026

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

pianpwk commented Feb 26, 2026

Uh oh!

pytorchmergebot commented Feb 26, 2026

Merge started

Uh oh!

pytorchmergebot commented Feb 26, 2026

Merge failed

Uh oh!

pianpwk commented Feb 26, 2026

Uh oh!

pytorchmergebot commented Feb 26, 2026

Merge started

Uh oh!

malfet commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Mar 2, 2026

Merge started

Uh oh!

malfet commented Mar 2, 2026

Uh oh!

pytorchmergebot commented Mar 2, 2026

Uh oh!

pytorchmergebot commented Mar 2, 2026

Uh oh!

albanD commented Mar 2, 2026

Uh oh!

pianpwk commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pianpwk commented Feb 25, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 25, 2026 •

edited

Loading