[jit] speed up alias analysis by suo · Pull Request #36345 · pytorch/pytorch

suo · 2020-04-09T21:39:43Z

Stack from ghstack:

[wip] update graph fuser aliasdb in-place #37106 [wip] update graph fuser aliasdb in-place
[jit] speed up alias analysis #36345 [jit] speed up alias analysis

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.
Properly memoize/dynamic program the memory locations lookups.
Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware setWildcards.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
python test_timing.py.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware setWildcards: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically: Everything that fans out from the
analyze call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: D20952727

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). ghstack-source-id: 65205e3 Pull Request resolved: #36345

dr-ci · 2020-04-09T21:56:30Z

💊 Build failures summary and remediations

As of commit 2931e90 (more details on the Dr. CI page):

5/6 failures possibly* introduced in this PR
- 5/5 non-CircleCI failure(s)
1/6 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 tentatively flaky failure

1 failure tentatively classified as flaky but reruns have not yet been triggered to confirm:

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

Step: "Build" (full log | pattern match details | 🔁 rerun) ❄️

Apr 30 17:35:26 Failed to recurse into submodule path 'third_party/onnx-tensorrt'

sys	0m0.053s 
Apr 30 17:34:46 ++ export BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Apr 30 17:34:46 ++ BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Apr 30 17:34:46 ++ git submodule sync 
Apr 30 17:34:47 ++ git submodule update -q --init --recursive 
Apr 30 17:35:26 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. 
Apr 30 17:35:26 fatal: The remote end hung up unexpectedly 
Apr 30 17:35:26 fatal: early EOF 
Apr 30 17:35:26 fatal: index-pack failed 
Apr 30 17:35:26 fatal: clone of 'https://github.com/onnx/onnx.git' into submodule path 'third_party/onnx' failed 
Apr 30 17:35:26 Failed to recurse into submodule path 'third_party/onnx-tensorrt'

Extra GitHub checks: 5 failed

Failed: GitHub Actions - clang-tidy
Failed: GitHub Actions - cmakelint
Failed: GitHub Actions - flake8-py3
Failed: GitHub Actions - quick-checks
Failed: GitHub Actions - clang-format

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 42 times.

eellison · 2020-04-10T19:41:28Z

@suo could you look into the failures before I review? i would expect this shouldn't have any behavioral changes bc it's not changing the analysis just speeding it up (it shouldn't break tests).

IlyaOvodov · 2020-04-13T07:34:27Z

@suo, 75% performance boosting is a great success, but still an application that starts for 3 minutes is almost as useless as one that starts for 10 min.
As far as I understood from reading issues and forums, this huge time is spent on optimization (that gives about several % of performance boost and is repeated every time when input shape is changed), compared with pure python PyTorch which starts the same net in less than 1 sec.
If I'm right, possibly it's a solution to give a way to switch this optimisation off when time it takes become unacceptable? Sorry if the reality is more complex than I expect :)

suo · 2020-04-15T00:31:05Z

@IlyaOvodov: correct, the optimization time on large networks is still way too high. In production environments, we typically expect this time to be amortized over the lifetime of model serving, but even then the current state is bad.

There are a number of ways to tune the execution strategy. For example, if you don't want to optimization and basically want the code to run in interpreter-mode, you can set the following:

with torch.jit.optimized_execution(false):
    my_model(inputs)

IlyaOvodov · 2020-04-15T08:46:22Z

@suo Wow! What a great undocumented cheat code!
But is there a way to do the same (turn it off) in C++ inference?

suo · 2020-04-15T21:30:26Z

But is there a way to do the same (turn it off) in C++ inference?

answered on #36040 as it's more appropriate for discussion.

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) at 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). ghstack-source-id: db45d28 Pull Request resolved: #36345

eellison · 2020-04-16T23:10:36Z

@suo i'm reviewing now but could you expand parts 2 and 3 in your summary with more information?

Properly memoize/dynamic program the memory locations lookups.

how ?

Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware setWildcards.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

what is the naive solution and why doesn't it work, what is the new approach why does that solve it

suo · 2020-04-17T04:30:54Z

@suo i'm reviewing now but could you expand parts 2 and 3 in your summary with more information?

Properly memoize/dynamic program the memory locations lookups.

The previous bfs implementation wasn't re-using the cached memorylocation lookup result, so we would re-do the whole search instead of memoizing.

Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware setWildcards.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

what is the naive solution and why doesn't it work, what is the new approach why does that solve it

Naive solution is to invalidate everything in the transitive "points-from" closure to the wildcard. Since setWildcard only affects memorylocations (which are at the "source" of the points-from graph), it has the effect of invalidating basically the entire cache.

The new approach is commented in the code, but basically involves updating the cache in place in linear time.

eellison

I had a hard time reviewing this code for correctness. There are a lot of new caches, and it's hard to parse out what their lifetimes are and when they are invalidated etc.

Could you please write up a description of how everything works together?

eellison · 2020-04-16T21:37:27Z

-  // traversing in the direction `dir`.`fn` will be run on each element.
-  void bfs(BfsDirection dir, MemoryLocations& res) const;
  friend class MemoryDAG;
+  mutable c10::optional<MemoryLocations> cachedMemoryLocations_;


this has lost its comment. maybe also add when it's expected to be set and when it's expected to be c10::nullopt

eellison · 2020-04-17T23:02:26Z

+      auto contained_elem = memoryDAG_->fromIndex(loc);
+      // we only register writes on memory locations
+      if (contained_elem->pointsTo.empty()) {
+        writeIndex[write.first].set(contained_elem->index);


Why are we calling .set() here and |= above ? Could we be just be using = instead of |= above

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: e9b36de Pull Request resolved: #37106

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 268558c Pull Request resolved: #37106

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: 74f3839 Pull Request resolved: #37106

eellison

This is a lot of complexity but given how much this is on the hot path i'm approving. Most of the complexity is because of wildcards and how they invalidate the memory dag state. We should investigate a context sensitive heap analysis so that wildcards are no longer special cased within the dag.

Could you please address my comment in getMemoryLocations before landing ?

eellison · 2020-04-28T23:37:10Z

-    auto el = dag.fromIndex(index);
-    if (el->pointsTo.empty()) {
-      res.set(index);
+  if (e->pointsTo.empty()) {


It's a little surprising that we're not using a seen set in our DFS implementation, it may not trigger now but that's an easy way for an exponential runtime. Could you add a getMemoryLocationsImpl with a seen set ?

Not sure what a seen set would do? If we have traversed e before, its cachedMemoryLocations_ will be populated and thus we will return immediately from the traversal.

yea nevermind, cachedMemoryLocations_ makes sure we don't redo work.

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

@IlyaOvodov

During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks @IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 6bc8ffe After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727) [ghstack-poisoned]

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: f80e060 Pull Request resolved: #37106

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: a29befa Pull Request resolved: #37106

Recomputing the aliasdb on every fusion iteration + in every subblock is hugely expensive. Instead, update it in-place when doing fusion. The graph fuser pass operates by pushing nodes into a fusion group. So we start with ``` x, y = f(a, b, c) ``` and end with: ``` x_out, y_out = prim::fusionGroup(a, b, c) x_in, y_in = f(a_in, b, c) -> x_in, y_in ``` We destroy the `x` and `y` `Value*`s in the process. This operation is easy to express as an update to the aliasDb--`x_out` just takes on all the aliasing information `x` used to have. In particular, since we know `f` and `prim::fusionGroup` are purely functional, we don't have to mess with any write information. This PR is the bare minimum to get this working, in the interest of unscrewing the compilation times ASAP. After this change, on the baseline introduced in #36345 we go from ~193s to ~5.2s, bringing us to a reasonable range (although there are still substantial improvements that can be made). Followups I want to do: - We don't have a way of expressing deletion of values in AliasDb. In `graph_fuser.cpp` we sometimes construct nodes that we end up throwing away, and we are littering `MemoryDAG` with references to dangling pointers. Because of the way the pass works, it's fine, but this is fragile so I want to fix it. - We should decouple alias analysis from write tracking, to simplify the job of keeping the write caches consistent as we mutate the aliasing information. - the tensorexpr fuser doesn't do this and thus is incorrect today, we need to update it to work. ghstack-source-id: eba073e Pull Request resolved: #37106

facebook-github-bot · 2020-05-01T06:16:09Z

@suo merged this pull request in 5efd105.

Summary: Pull Request resolved: pytorch#36345 During compilation, we spend a huge amount of time in alias analyis. This PR does a few things to speed it up. 1. Separate the analysis into two phases: one where we build up the necessary data structures, and the other where we service aliasing queries. This allows us to defer building indices/maintaining index consistency until after the "buildup" phase is done. 2. Properly memoize/dynamic program the memory locations lookups. 3. Done naively, setting wildcards invalidates the above memoization, trigger costly recomputation. So I added a cache-aware `setWildcards`. Sadly that means you need alias analysis to reach into the guts of memorydag, but the speedup is worth it. Sadly, these changes are kind of coupled for correctness reasons, so they're all here at once. I used this model (thanks IlyaOvodov) as a provisional benchmark. You can get it here: https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run `python test_timing.py`. Baseline: (752.076s) right before 7488268 After optimizing before inlining: (699.593s) After deferring cache construction: (426.180s) After cache-aware `setWildcards`: (193.678s) So a nice 75% speedup to overall compilation. There's a lot more to do in other places of the compilation pipeline though. Followup to this PR specifically: Everything that fans out from the `analyze` call is the "buildup" phase of AliasDB construction. This should be factored into a separate analysis pass to statically distinguish the two phases (right now we just null out stuff to accomplish the same thing dynamically). Test Plan: Imported from OSS Differential Revision: D20952727 Pulled By: suo fbshipit-source-id: 099f797222d7e71e5c04991584adc2c7eab5a70f

suo requested a review from apaszke as a code owner April 9, 2020 21:39

suo mentioned this pull request Apr 9, 2020

silence attributes.h warning #36344

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 9, 2020

suo requested a review from eellison April 9, 2020 21:41

IlyaOvodov mentioned this pull request Apr 15, 2020

[JIT] Huge delay (1274s vs 0.031s) when running scripted model #36040

Open

eellison reviewed Apr 17, 2020

View reviewed changes

suo mentioned this pull request Apr 22, 2020

[wip] update graph fuser aliasdb in-place #37106

Closed

suo added 2 commits April 22, 2020 15:46

suo mentioned this pull request Apr 24, 2020

add fmt #37123

Closed

eellison approved these changes Apr 28, 2020

View reviewed changes

facebook-github-bot closed this in 5efd105 May 1, 2020

facebook-github-bot added the merged label May 1, 2020

facebook-github-bot deleted the gh/suo/304/head branch May 4, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Conversation

suo commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

❄️ 1 tentatively flaky failure

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

Extra GitHub checks: 5 failed

Uh oh!

eellison commented Apr 10, 2020

Uh oh!

IlyaOvodov commented Apr 13, 2020

Uh oh!

suo commented Apr 15, 2020

Uh oh!

IlyaOvodov commented Apr 15, 2020

Uh oh!

suo commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suo commented Apr 17, 2020

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eellison Apr 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eellison Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eellison left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eellison Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

suo Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

eellison Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

suo commented Apr 9, 2020 •

edited

Loading

dr-ci Bot commented Apr 9, 2020 •

edited

Loading

suo commented Apr 15, 2020 •

edited

Loading

eellison commented Apr 16, 2020 •

edited

Loading

eellison left a comment •

edited

Loading