[wip] update graph fuser aliasdb in-place#37106
[wip] update graph fuser aliasdb in-place#37106suo wants to merge 12 commits intogh/suo/306/basefrom
Conversation
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
💊 Build failures summary and remediationsAs of commit 953d48e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 45 times. |
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 219fb16 Pull Request resolved: #37106
|
I think this code is in the same place as shape analysis where it's soon to be deleted and it's not on by default anywhere. Could you instead submit a PR speeduping of https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/passes/tensorexpr_fuser.cpp if it's warranted ? |
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
This pass is still what the existing fusion compiler uses, and thus what people who are trying to fuse code today uses, so we need these changes. I will also fix up the texpr fuser too though. |
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: e9b36de Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 268558c Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 74f3839 Pull Request resolved: #37106
eellison
left a comment
There was a problem hiding this comment.
Is there a POC for graph_fuser who can verify some of these changes ? If not I can try to understand all of how the graph fuser works but I have not done that yet. or a high level explanation of why it's correct. f and prim::fusionGroup are purely functional may be true i'm not convinced all inputs and outputs don't have writers.
| void AliasDb::unsafeGiveFreshAlias(const Value* value) { | ||
| void AliasDb::createValue(const Value* value) { | ||
| TORCH_INTERNAL_ASSERT(isMutableTypeInternal(value->type())); | ||
| TORCH_INTERNAL_ASSERT(value->type()->containedTypes().size() == 0); |
There was a problem hiding this comment.
Why were these assertions removed ?
There was a problem hiding this comment.
Adding contained types stuff to the mutation API wasn't necessary for this PR—but I think it's a good followup/could do it for this one if you feel that it's important.
There was a problem hiding this comment.
sure it doesn't need to be done in this PR
| return ret; | ||
| } | ||
|
|
||
| void Lint(const AliasDb* db) { |
There was a problem hiding this comment.
It's worth noting that not all passes maintain the invariants here as they are mutating the graph. For example, constant propagation inserts new Tensor Constants which will not have an element in the memory dag.
| * `Lint()`ing the AliasDb can help with this. | ||
| */ | ||
| // Copy `existing`s aliasing info to `new_value`, and remove `existing`. | ||
| void replaceWithNewValue(Value* existing, Value* new_value); |
There was a problem hiding this comment.
It's a little scary to make this a public api. I would maybe prepend _ or __ to all of these methods. Or name them internal. Now that they are public we no longer need friend struct MutationRemover.
There was a problem hiding this comment.
I put a big warning in the comment,idk . The reason they are named this way is that it's the same as the LLVM alias analysis update API. I suppose we can add a _ or something but idk if it will deter a sufficiently motivated downstream user.
I believe we still need to friend MutationRemover because it manipulates write caches as well.
There was a problem hiding this comment.
I would maybe put something an internal clarification here, since we don't specify anywhere what updating AliasDb correctly means.
| // dag so we must regive x an alias db element. We have already verified | ||
| // that the mutated value is a fresh alias with a single use. | ||
| aliasDb_->unsafeGiveFreshAlias(mutated_value); | ||
| aliasDb_->createValue(mutated_value); |
There was a problem hiding this comment.
is this not still unsafe? the invariants for when this is okay are still very specific: the value has to not have contained elements and be unaliased / mutated in the graph.
There was a problem hiding this comment.
See above, I wanted to basically copy an existing API so people would know what it did.
| auto* g = inputs[0]->owningGraph(); | ||
| auto* input_list = | ||
| g->insertNode(g->createList(TensorType::get(), inputs))->output(); | ||
| aliasDb_->createValue(input_list); |
There was a problem hiding this comment.
It's a little unfortunate that we're creating a new value with a contained element
| // Two checks that we want to add but can't until the mutation API is more | ||
| // fully developed. | ||
| // - Every mutable value in the aliasdb belongs to the graph | ||
| // - All container values have contained elements |
There was a problem hiding this comment.
I think we're not checking this bc we break this invariant as part of this PR.
There was a problem hiding this comment.
Yeah, the comment notes that it is not yet added pending a more robust mutation API.
The purpose of copying or replacing aliasing information is to preserve the information that some inputs or outputs have writers. Other than that, I tried to provide a high level description in the PR summary of what graph fusion is doing and why I think the changes are correct. The lint pass ensures that we're not missing the propagation of aliasing information, so the main place where we might be wrong is that we're copying the wrong aliasing information—but our replacements follow |
eellison
left a comment
There was a problem hiding this comment.
LGTM, this increases complexity but at a great speedup. As you said the fusion graphs are functional so they're shouldn't be too much to worry about here. This is also off by default.
| void AliasDb::unsafeGiveFreshAlias(const Value* value) { | ||
| void AliasDb::createValue(const Value* value) { | ||
| TORCH_INTERNAL_ASSERT(isMutableTypeInternal(value->type())); | ||
| TORCH_INTERNAL_ASSERT(value->type()->containedTypes().size() == 0); |
There was a problem hiding this comment.
sure it doesn't need to be done in this PR
| * `Lint()`ing the AliasDb can help with this. | ||
| */ | ||
| // Copy `existing`s aliasing info to `new_value`, and remove `existing`. | ||
| void replaceWithNewValue(Value* existing, Value* new_value); |
There was a problem hiding this comment.
I would maybe put something an internal clarification here, since we don't specify anywhere what updating AliasDb correctly means.
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: f80e060 Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: a29befa Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Differential Revision: [D21219179](https://our.internmc.facebook.com/intern/diff/D21219179) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: eba073e Pull Request resolved: #37106
…tes to aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…o aliasDb" Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Differential Revision: [D22798377](https://our.internmc.facebook.com/intern/diff/D22798377) [ghstack-poisoned]
…42141) Summary: Pull Request resolved: #42141 Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from #37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D22798377 Pulled By: eellison fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
Summary: Pull Request resolved: pytorch#37106 Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. Test Plan: Imported from OSS Differential Revision: D21219179 Pulled By: suo fbshipit-source-id: 8ae5397b3a0ad90edec2fbc555647091f1ad5284
…ytorch#42141) Summary: Pull Request resolved: pytorch#42141 Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior. Description from pytorch#37106 holds pretty well: """ Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with `x, y = f(a, b, c)` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b_in, c_in) -> x_in, y_in ``` We destroy the x and y Value*s in the process. This operation is easy to express as an update to the aliasDb--x_out just takes on all the aliasing information x used to have. In particular, since we know f and prim::fusionGroup are purely functional, we don't have to mess with any write information. """ The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node. There are a few options: - attempt to make all subgraph utils & ir cloning logic update a map - mirror the subgraph utils implementation in create_autodiff_subgraph - uniquely map x, y and x_in, y_in so you can back out the correspondence. I went with the third option. This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated. Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D22798377 Pulled By: eellison fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
Stack from ghstack:
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.
The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
and end with:
We destroy the
xandyValue*s in the process. This operation iseasy to express as an update to the aliasDb--
x_outjust takes on allthe aliasing information
xused to have. In particular, since we knowfandprim::fusionGroupare purely functional, we don't have to messwith any write information.
This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.
Followups I want to do:
graph_fuser.cppwe sometimes construct nodes that we end up throwingaway, and we are littering
MemoryDAGwith references to danglingpointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
job of keeping the write caches consistent as we mutate the aliasing
information.
need to update it to work.
Differential Revision: D21219179