[DTensor] Make Replicate->Partial cost > 0 by wconstab · Pull Request #172282 · pytorch/pytorch

wconstab · 2026-01-12T22:57:42Z

Stack from ghstack (oldest at bottom):

The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.

It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.

Before this PR, when multiplying Partial("max") * Replicate, the strategy:

[Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
[Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

pytorch-bot · 2026-01-12T22:57:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172282

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 739240d with merge base 7754b55 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-pyrefly-partial / linux-job (gh)
>>> Lint for torch/distributed/tensor/_ops/utils.py:
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
test/distributed/tensor/test_redistribute.py::RedistributeTest::test_replicate_to_partial_has_nonzero_cost

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
shufflenet_v2_x1_0
inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (disabled by #137684)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

fmassa · 2026-01-13T16:48:36Z

I think modelling the compute cost associated with a given redistribution is a good thing to have.
We do it in AutoParallel through the "compute cost" part, and it was important for cases like additional copies / etc that were involved in a given redistribution.

I'm just not sure if we should bundle comms and compute cost in the same place.
Additionally, if we do add a compute cost maybe we should be more thorough on it and model the compute cost more accurately?

Wdyt?

wconstab · 2026-01-13T16:52:08Z

I'm just not sure if we should bundle comms and compute cost in the same place.
Additionally, if we do add a compute cost maybe we should be more thorough on it and model the compute cost more accurately?

@weifengpy had the same reaction, to not bundle them.

Do you think it makes sense to model them separately, but then offer a bundled 'total cost' api? I think for DTensor purposes, I just want the total cost. Not sure if we would want to separate the costs for some reason in DTensor too?

fmassa · 2026-01-13T17:05:34Z

I think that if we start adding additional one-off costs (like the extra division that happens in this case), then for consistency we should also model additional copies that might happen, see what we do in autoparallel for an example.

And the reason to model it more carefully is that the break-even value you are adding will probably won't be enough for all use-cases. But then, we will have to model the compute cost taking different GPU architectures into account (as they have different bandwidth)

And if we start discussing modelling those costs more accurately, we should also model the communication costs for different GPUs / interconnects (as the current cost model hard-codes A100 GPUs)

IMO, I think we should improve our cost models across the board, but it might require a bit of discussion about what we will want to model, as it can quickly grow a bit in scope.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

wconstab · 2026-01-14T01:07:08Z

@fmassa i would be happy to model num copies and memory bandwidth instead of hardcoding a constant.

The first question though, do you want separate comms and compute costs through separate APIs? in autop as well as for my PR, the costs are just summed.

weifengpy · 2026-01-14T07:20:01Z

comm cost >> local compute cost is what I thought. stragglers are the major bottlenecks, not msg size or cpu overhead.

For immediate landable version, I was just proposing using local compute cost to break the tie when comm cost is on par.

modeling local compute &data movement cost sounds totally reasonable, but even with that, I was still thinking about comm cost >> local cost, and prefer using local cost for break the tie

wconstab · 2026-01-14T18:04:15Z

@weifengpy I agree that generally comm cost >> compute cost. However for extremely small message sizes this may not be true. I think accurately modeling the compute part and summing them together gives redistribute planner the most accurate signal.

Are you suggesting that we should not include the compute cost in the cost value, but explicitly only consider compute cost as a separate 'tie break' step during min cost strategy selection? I feel like this way is actually more confusing / complex and I am not sure it adds value.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

weifengpy · 2026-01-14T21:41:13Z

However for extremely small message sizes this may not be true

sorry I should mentioned my justification. We regressed the perf a lot by adding grad clipping (communicating a scalar value) to recommendation workload. I realized it's the sync point (or stragglers) that matters the most than msg size. Another example is we never achived expected perf gain when reducing the msg size by switching from bf16 to fp8 (50% msg size but only achieve <15% perf gain). It's still because of stragglers. That makes me reach the extreme thought that comm cost 0 (no comm) -> comm cost 0.01 (comm a scalar value) are fundmentally different, no matter what local compute cost is

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. ghstack-source-id: 8ce9f4c Pull Request resolved: pytorch/pytorch#172282

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

wconstab · 2026-01-21T00:12:49Z

sorry I should mentioned my justification. We regressed the perf a lot by adding grad clipping (communicating a scalar value) to recommendation workload. I realized it's the sync point (or stragglers) that matters the most than msg size. Another example is we never achived expected perf gain when reducing the msg size by switching from bf16 to fp8 (50% msg size but only achieve <15% perf gain). It's still because of stragglers. That makes me reach the extreme thought that comm cost 0 (no comm) -> comm cost 0.01 (comm a scalar value) are fundmentally different, no matter what local compute cost is

I think this is compatible with my framing.

IIUC your point is that any comm, even a tiny one, can lead to a cost larger than its bandwidth-time computation would suggest. I agree with this.

Overall, the cost model for an operator can include contributions from these sources, depending on which operator it is and how many kernels it calls:

fixed latency (model the overhead of doing a kernel launch)
bandwidth / compute time (based on num elements and either tflops or memory bw or comms bw)
straggler overhead (could model as some % increase on top of the above, although, I am not too confident on how to model this effectively.

for DTensor, I still like having all of this summed up into one number - think of it as modeling 'redistribute time' - and minimizing over that for strategy selection.

sanketpurandare

Makes sense to me. IIUC,
The goal is to avoid selecting strategies that introduce unnecessary placement conversions, especially ones involving Partial, when an equally-good (or better) Replicate-only strategy exists.
The reason being, even if two strategies are equivalent for the current op’s output, introducing a Partial can impose real downstream costs because a later consumer may have to “reduce (finish)” that Partial to satisfy its own valid strategies. This is exactly the kind of “hidden future tax” a local cost model can miss.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. ghstack-source-id: 9ebb4e5 Pull Request resolved: pytorch/pytorch#172282

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. [ghstack-poisoned]

wconstab · 2026-03-03T05:11:08Z

abandoning for now. don't need the short-term fix because I banned mixed partials, i think. Still want the long term improvement of more accurate cost models, but that wasn't done in this PR anyway.

The cost of doing this conversion is actually nonzero as it involves dispatching some operators - currently this differs depending on which type of Partial, as each defines its own 'partition' function, but in general could be a scaling operation. It's helpful to express this as non-free in the cost model becuase otherwise it is likely that a suboptimal op sharding strategy will be selected on the basis that it's just as cheap to convert one partial through replica to another partial as it is to stay in replicate. Before this PR, when multiplying Partial("max") * Replicate, the strategy: - [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax -> Replicate -> Psum) - [Replicate, Replicate, Replicate] has cost 22.82 (Pmax -> Replicate) And we would select which ever appears first in the strategy list. ghstack-source-id: 9cca092 Pull Request resolved: pytorch/pytorch#172282

wconstab mentioned this pull request Jan 12, 2026

[DTensor] Handle out= ops in single-dim expander #172276

Closed

wconstab mentioned this pull request Jan 12, 2026

[DTensor] Fix partial redistribution order #172277

Closed

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Jan 12, 2026

wconstab mentioned this pull request Jan 12, 2026

[DTensor] Complete single-dim pointwise rule #172278

Closed

wconstab requested review from fmassa and zpcore January 13, 2026 00:01

wconstab added 2 commits January 12, 2026 19:39

This was referenced Jan 14, 2026

[DTensor] make expand_to_full_mesh_op_strategy filter incompatible out= strategies #172420

Closed

[DTensor] single_dim fix symint + _create_expanded_strategy #172421

Closed

wconstab added 2 commits January 14, 2026 10:14

This was referenced Jan 14, 2026

[DTensor] single dim fix inplace op expansion #172477

Closed

[DTensor] fix redistribute cost crashing on non-participating ranks #172478

Closed

[DTensor] Make RedistributionPlanner handle all partials #172479

Closed

wconstab added 3 commits January 14, 2026 16:23

wconstab added 2 commits January 19, 2026 20:16

sanketpurandare approved these changes Jan 21, 2026

View reviewed changes

wconstab added 4 commits January 25, 2026 20:01

wconstab mentioned this pull request Jan 27, 2026

[DTensor] Update TP api to support single-dim strategies #173567

Closed

wconstab mentioned this pull request Jan 27, 2026

[DTensor] single-dim expander raises clear inplace error #173572

Closed

wconstab closed this Mar 3, 2026

github-actions Bot deleted the gh/wconstab/495/head branch April 3, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Make Replicate->Partial cost > 0#172282

[DTensor] Make Replicate->Partial cost > 0#172282
wconstab wants to merge 17 commits intogh/wconstab/495/basefrom
gh/wconstab/495/head

wconstab commented Jan 12, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

fmassa commented Jan 13, 2026

Uh oh!

wconstab commented Jan 13, 2026

Uh oh!

fmassa commented Jan 13, 2026

Uh oh!

wconstab commented Jan 14, 2026

Uh oh!

weifengpy commented Jan 14, 2026

Uh oh!

wconstab commented Jan 14, 2026

Uh oh!

weifengpy commented Jan 14, 2026 •

edited

Loading

Uh oh!

wconstab commented Jan 21, 2026

Uh oh!

sanketpurandare left a comment

Uh oh!

wconstab commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wconstab commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172282

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

fmassa commented Jan 13, 2026

Uh oh!

wconstab commented Jan 13, 2026

Uh oh!

fmassa commented Jan 13, 2026

Uh oh!

wconstab commented Jan 14, 2026

Uh oh!

weifengpy commented Jan 14, 2026

Uh oh!

wconstab commented Jan 14, 2026

Uh oh!

weifengpy commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Jan 21, 2026

Uh oh!

sanketpurandare left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Jan 12, 2026 •

edited

Loading

pytorch-bot Bot commented Jan 12, 2026 •

edited

Loading

weifengpy commented Jan 14, 2026 •

edited

Loading