[DTensor] Make Replicate->Partial cost > 0#172282
[DTensor] Make Replicate->Partial cost > 0#172282wconstab wants to merge 17 commits intogh/wconstab/495/basefrom
Conversation
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172282
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 739240d with merge base 7754b55 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
|
I think modelling the compute cost associated with a given redistribution is a good thing to have. I'm just not sure if we should bundle comms and compute cost in the same place. Wdyt? |
@weifengpy had the same reaction, to not bundle them. Do you think it makes sense to model them separately, but then offer a bundled 'total cost' api? I think for DTensor purposes, I just want the total cost. Not sure if we would want to separate the costs for some reason in DTensor too? |
|
I think that if we start adding additional one-off costs (like the extra division that happens in this case), then for consistency we should also model additional copies that might happen, see what we do in autoparallel for an example. And the reason to model it more carefully is that the break-even value you are adding will probably won't be enough for all use-cases. But then, we will have to model the compute cost taking different GPU architectures into account (as they have different bandwidth) And if we start discussing modelling those costs more accurately, we should also model the communication costs for different GPUs / interconnects (as the current cost model hard-codes A100 GPUs) IMO, I think we should improve our cost models across the board, but it might require a bit of discussion about what we will want to model, as it can quickly grow a bit in scope. |
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
|
@fmassa i would be happy to model num copies and memory bandwidth instead of hardcoding a constant. The first question though, do you want separate comms and compute costs through separate APIs? in autop as well as for my PR, the costs are just summed. |
|
comm cost >> local compute cost is what I thought. stragglers are the major bottlenecks, not msg size or cpu overhead. For immediate landable version, I was just proposing using local compute cost to break the tie when comm cost is on par. modeling local compute &data movement cost sounds totally reasonable, but even with that, I was still thinking about comm cost >> local cost, and prefer using local cost for break the tie |
|
@weifengpy I agree that generally comm cost >> compute cost. However for extremely small message sizes this may not be true. I think accurately modeling the compute part and summing them together gives redistribute planner the most accurate signal. Are you suggesting that we should not include the compute cost in the cost value, but explicitly only consider compute cost as a separate 'tie break' step during min cost strategy selection? I feel like this way is actually more confusing / complex and I am not sure it adds value. |
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
sorry I should mentioned my justification. We regressed the perf a lot by adding grad clipping (communicating a scalar value) to recommendation workload. I realized it's the sync point (or stragglers) that matters the most than msg size. Another example is we never achived expected perf gain when reducing the msg size by switching from bf16 to fp8 (50% msg size but only achieve <15% perf gain). It's still because of stragglers. That makes me reach the extreme thought that comm cost 0 (no comm) -> comm cost 0.01 (comm a scalar value) are fundmentally different, no matter what local compute cost is |
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
ghstack-source-id: 8ce9f4c
Pull Request resolved: pytorch/pytorch#172282
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
I think this is compatible with my framing. IIUC your point is that any comm, even a tiny one, can lead to a cost larger than its bandwidth-time computation would suggest. I agree with this. Overall, the cost model for an operator can include contributions from these sources, depending on which operator it is and how many kernels it calls:
for DTensor, I still like having all of this summed up into one number - think of it as modeling 'redistribute time' - and minimizing over that for strategy selection. |
sanketpurandare
left a comment
There was a problem hiding this comment.
Makes sense to me. IIUC,
The goal is to avoid selecting strategies that introduce unnecessary placement conversions, especially ones involving Partial, when an equally-good (or better) Replicate-only strategy exists.
The reason being, even if two strategies are equivalent for the current op’s output, introducing a Partial can impose real downstream costs because a later consumer may have to “reduce (finish)” that Partial to satisfy its own valid strategies. This is exactly the kind of “hidden future tax” a local cost model can miss.
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
ghstack-source-id: 9ebb4e5
Pull Request resolved: pytorch/pytorch#172282
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
[ghstack-poisoned]
|
abandoning for now. don't need the short-term fix because I banned mixed partials, i think. Still want the long term improvement of more accurate cost models, but that wasn't done in this PR anyway. |
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
- [Partial(sum), Replicate, Partial(sum)] has cost 22.82 (Pmax ->
Replicate -> Psum)
- [Replicate, Replicate, Replicate] has cost 22.82 (Pmax ->
Replicate)
And we would select which ever appears first in the strategy list.
ghstack-source-id: 9cca092
Pull Request resolved: pytorch/pytorch#172282
Stack from ghstack (oldest at bottom):
The cost of doing this conversion is actually nonzero as it involves
dispatching some operators - currently this differs depending on which
type of Partial, as each defines its own 'partition' function, but in
general could be a scaling operation.
It's helpful to express this as non-free in the cost model becuase
otherwise it is likely that a suboptimal op sharding strategy will be
selected on the basis that it's just as cheap to convert one partial
through replica to another partial as it is to stay in replicate.
Before this PR, when multiplying Partial("max") * Replicate, the strategy:
Replicate -> Psum)
Replicate)
And we would select which ever appears first in the strategy list.