[jit] speed up alias analysis#36345
Conversation
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). ghstack-source-id: 65205e3 Pull Request resolved: #36345
💊 Build failures summary and remediationsAs of commit 2931e90 (more details on the Dr. CI page):
❄️ 1 tentatively flaky failure1 failure tentatively classified as flaky but reruns have not yet been triggered to confirm:
|
|
@suo could you look into the failures before I review? i would expect this shouldn't have any behavioral changes bc it's not changing the analysis just speeding it up (it shouldn't break tests). |
|
@suo, 75% performance boosting is a great success, but still an application that starts for 3 minutes is almost as useless as one that starts for 10 min. |
|
@IlyaOvodov: correct, the optimization time on large networks is still way too high. In production environments, we typically expect this time to be amortized over the lifetime of model serving, but even then the current state is bad. There are a number of ways to tune the execution strategy. For example, if you don't want to optimization and basically want the code to run in interpreter-mode, you can set the following: |
|
@suo Wow! What a great undocumented cheat code! |
answered on #36040 as it's more appropriate for discussion. |
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). ghstack-source-id: db45d28 Pull Request resolved: #36345
|
@suo i'm reviewing now but could you expand parts 2 and 3 in your summary with more information?
how ?
what is the naive solution and why doesn't it work, what is the new approach why does that solve it |
The previous bfs implementation wasn't re-using the cached memorylocation lookup result, so we would re-do the whole search instead of memoizing.
Naive solution is to invalidate everything in the transitive "points-from" closure to the wildcard. Since The new approach is commented in the code, but basically involves updating the cache in place in linear time. |
eellison
left a comment
There was a problem hiding this comment.
I had a hard time reviewing this code for correctness. There are a lot of new caches, and it's hard to parse out what their lifetimes are and when they are invalidated etc.
Could you please write up a description of how everything works together?
| // traversing in the direction `dir`.`fn` will be run on each element. | ||
| void bfs(BfsDirection dir, MemoryLocations& res) const; | ||
| friend class MemoryDAG; | ||
| mutable c10::optional<MemoryLocations> cachedMemoryLocations_; |
There was a problem hiding this comment.
this has lost its comment. maybe also add when it's expected to be set and when it's expected to be c10::nullopt
| auto contained_elem = memoryDAG_->fromIndex(loc); | ||
| // we only register writes on memory locations | ||
| if (contained_elem->pointsTo.empty()) { | ||
| writeIndex[write.first].set(contained_elem->index); |
There was a problem hiding this comment.
Why are we calling .set() here and |= above ? Could we be just be using = instead of |= above
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: e9b36de Pull Request resolved: #37106
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 268558c Pull Request resolved: #37106
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 74f3839 Pull Request resolved: #37106
There was a problem hiding this comment.
This is a lot of complexity but given how much this is on the hot path i'm approving. Most of the complexity is because of wildcards and how they invalidate the memory dag state. We should investigate a context sensitive heap analysis so that wildcards are no longer special cased within the dag.
Could you please address my comment in getMemoryLocations before landing ?
| auto el = dag.fromIndex(index); | ||
| if (el->pointsTo.empty()) { | ||
| res.set(index); | ||
| if (e->pointsTo.empty()) { |
There was a problem hiding this comment.
It's a little surprising that we're not using a seen set in our DFS implementation, it may not trigger now but that's an easy way for an exponential runtime. Could you add a getMemoryLocationsImpl with a seen set ?
There was a problem hiding this comment.
Not sure what a seen set would do? If we have traversed e before, its cachedMemoryLocations_ will be populated and thus we will return immediately from the traversal.
There was a problem hiding this comment.
yea nevermind, cachedMemoryLocations_ makes sure we don't redo work.
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: f80e060 Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: a29befa Pull Request resolved: #37106
Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: eba073e Pull Request resolved: #37106
Summary: Pull Request resolved: pytorch#36345 During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 7488268 After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Test Plan: Imported from OSS Differential Revision: D20952727 Pulled By: suo fbshipit-source-id: 099f797222d7e71e5c04991584adc2c7eab5a70f
Stack from ghstack:
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.
Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.
Properly memoize/dynamic program the memory locations lookups.
Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware
setWildcards.Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.
Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.
I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
python test_timing.py.Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware
setWildcards: (193.678s)So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.
Followup to this PR specifically: Everything that fans out from the
analyzecall is the "buildup" phase of AliasDB construction. Thisshould be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).
Differential Revision: D20952727