Skip to content

[jit] speed up alias analysis#36345

Closed
suo wants to merge 10 commits intogh/suo/304/basefrom
gh/suo/304/head
Closed

[jit] speed up alias analysis#36345
suo wants to merge 10 commits intogh/suo/304/basefrom
gh/suo/304/head

Conversation

@suo
Copy link
Copy Markdown
Member

@suo suo commented Apr 9, 2020

Stack from ghstack:

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

  1. Separate the analysis into two phases: one where we build up the
    necessary data structures, and the other where we service aliasing
    queries. This allows us to defer building indices/maintaining index
    consistency until after the "buildup" phase is done.

  2. Properly memoize/dynamic program the memory locations lookups.

  3. Done naively, setting wildcards invalidates the above memoization,
    trigger costly recomputation. So I added a cache-aware setWildcards.
    Sadly that means you need alias analysis to reach into the guts of
    memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
python test_timing.py.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware setWildcards: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically: Everything that fans out from the
analyze call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: D20952727

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) at 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

[ghstack-poisoned]
@suo suo requested a review from apaszke as a code owner April 9, 2020 21:39
suo added a commit that referenced this pull request Apr 9, 2020
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) at 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

ghstack-source-id: 65205e3
Pull Request resolved: #36345
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 9, 2020
@suo suo requested a review from eellison April 9, 2020 21:41
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 9, 2020

💊 Build failures summary and remediations

As of commit 2931e90 (more details on the Dr. CI page):



❄️ 1 tentatively flaky failure

1 failure tentatively classified as flaky but reruns have not yet been triggered to confirm:

See CircleCI build caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

Step: "Build" (full log | pattern match details | 🔁 rerun) ❄️

Apr 30 17:35:26 Failed to recurse into submodule path 'third_party/onnx-tensorrt'
sys	0m0.053s 
Apr 30 17:34:46 ++ export BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Apr 30 17:34:46 ++ BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Apr 30 17:34:46 ++ git submodule sync 
Apr 30 17:34:47 ++ git submodule update -q --init --recursive 
Apr 30 17:35:26 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. 
Apr 30 17:35:26 fatal: The remote end hung up unexpectedly 
Apr 30 17:35:26 fatal: early EOF 
Apr 30 17:35:26 fatal: index-pack failed 
Apr 30 17:35:26 fatal: clone of 'https://github.com/onnx/onnx.git' into submodule path 'third_party/onnx' failed 
Apr 30 17:35:26 Failed to recurse into submodule path 'third_party/onnx-tensorrt' 

Extra GitHub checks: 5 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 42 times.

@eellison
Copy link
Copy Markdown
Contributor

@suo could you look into the failures before I review? i would expect this shouldn't have any behavioral changes bc it's not changing the analysis just speeding it up (it shouldn't break tests).

@IlyaOvodov
Copy link
Copy Markdown

@suo, 75% performance boosting is a great success, but still an application that starts for 3 minutes is almost as useless as one that starts for 10 min.
As far as I understood from reading issues and forums, this huge time is spent on optimization (that gives about several % of performance boost and is repeated every time when input shape is changed), compared with pure python PyTorch which starts the same net in less than 1 sec.
If I'm right, possibly it's a solution to give a way to switch this optimisation off when time it takes become unacceptable? Sorry if the reality is more complex than I expect :)

@suo
Copy link
Copy Markdown
Member Author

suo commented Apr 15, 2020

@IlyaOvodov: correct, the optimization time on large networks is still way too high. In production environments, we typically expect this time to be amortized over the lifetime of model serving, but even then the current state is bad.

There are a number of ways to tune the execution strategy. For example, if you don't want to optimization and basically want the code to run in interpreter-mode, you can set the following:

with torch.jit.optimized_execution(false):
    my_model(inputs)

@IlyaOvodov
Copy link
Copy Markdown

@suo Wow! What a great undocumented cheat code!
But is there a way to do the same (turn it off) in C++ inference?

@suo
Copy link
Copy Markdown
Member Author

suo commented Apr 15, 2020

But is there a way to do the same (turn it off) in C++ inference?

answered on #36040 as it's more appropriate for discussion.

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added a commit that referenced this pull request Apr 15, 2020
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) at 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

ghstack-source-id: db45d28
Pull Request resolved: #36345
@eellison
Copy link
Copy Markdown
Contributor

eellison commented Apr 16, 2020

@suo i'm reviewing now but could you expand parts 2 and 3 in your summary with more information?

Properly memoize/dynamic program the memory locations lookups.

how ?

Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware setWildcards.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

what is the naive solution and why doesn't it work, what is the new approach why does that solve it

@suo
Copy link
Copy Markdown
Member Author

suo commented Apr 17, 2020

@suo i'm reviewing now but could you expand parts 2 and 3 in your summary with more information?

Properly memoize/dynamic program the memory locations lookups.

The previous bfs implementation wasn't re-using the cached memorylocation lookup result, so we would re-do the whole search instead of memoizing.

Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware setWildcards.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

what is the naive solution and why doesn't it work, what is the new approach why does that solve it

Naive solution is to invalidate everything in the transitive "points-from" closure to the wildcard. Since setWildcard only affects memorylocations (which are at the "source" of the points-from graph), it has the effect of invalidating basically the entire cache.

The new approach is commented in the code, but basically involves updating the cache in place in linear time.

Copy link
Copy Markdown
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time reviewing this code for correctness. There are a lot of new caches, and it's hard to parse out what their lifetimes are and when they are invalidated etc.

Could you please write up a description of how everything works together?

Comment thread torch/csrc/jit/passes/utils/memory_dag.h Outdated
// traversing in the direction `dir`.`fn` will be run on each element.
void bfs(BfsDirection dir, MemoryLocations& res) const;
friend class MemoryDAG;
mutable c10::optional<MemoryLocations> cachedMemoryLocations_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has lost its comment. maybe also add when it's expected to be set and when it's expected to be c10::nullopt

Comment thread torch/csrc/jit/ir/alias_analysis.h Outdated
Comment thread torch/csrc/jit/passes/utils/memory_dag.h Outdated
Comment thread torch/csrc/jit/ir/alias_analysis.cpp Outdated
Comment thread torch/csrc/jit/ir/alias_analysis.cpp Outdated
auto contained_elem = memoryDAG_->fromIndex(loc);
// we only register writes on memory locations
if (contained_elem->pointsTo.empty()) {
writeIndex[write.first].set(contained_elem->index);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling .set() here and |= above ? Could we be just be using = instead of |= above

Comment thread torch/csrc/jit/ir/alias_analysis.cpp
Comment thread torch/csrc/jit/ir/alias_analysis.cpp
Comment thread torch/csrc/jit/ir/alias_analysis.cpp
Comment thread torch/csrc/jit/passes/utils/memory_dag.cpp
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added 2 commits April 22, 2020 15:46
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
@suo suo mentioned this pull request Apr 24, 2020
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added a commit that referenced this pull request Apr 24, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: e9b36de
Pull Request resolved: #37106
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added a commit that referenced this pull request Apr 24, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: 268558c
Pull Request resolved: #37106
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added a commit that referenced this pull request Apr 24, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: 74f3839
Pull Request resolved: #37106
Copy link
Copy Markdown
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of complexity but given how much this is on the hot path i'm approving. Most of the complexity is because of wildcards and how they invalidate the memory dag state. We should investigate a context sensitive heap analysis so that wildcards are no longer special cased within the dag.

Could you please address my comment in getMemoryLocations before landing ?

Comment thread torch/csrc/jit/ir/alias_analysis.cpp
Comment thread torch/csrc/jit/ir/alias_analysis.cpp Outdated
Comment thread torch/csrc/jit/ir/alias_analysis.cpp
Comment thread torch/csrc/jit/ir/alias_analysis.cpp Outdated
Comment thread torch/csrc/jit/ir/alias_analysis.cpp Outdated
Comment thread torch/csrc/jit/passes/utils/memory_dag.cpp
auto el = dag.fromIndex(index);
if (el->pointsTo.empty()) {
res.set(index);
if (e->pointsTo.empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little surprising that we're not using a seen set in our DFS implementation, it may not trigger now but that's an easy way for an exponential runtime. Could you add a getMemoryLocationsImpl with a seen set ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what a seen set would do? If we have traversed e before, its cachedMemoryLocations_ will be populated and thus we will return immediately from the traversal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea nevermind, cachedMemoryLocations_ makes sure we don't redo work.

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks @IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Differential Revision: [D20952727](https://our.internmc.facebook.com/intern/diff/D20952727)

[ghstack-poisoned]
suo added a commit that referenced this pull request Apr 30, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: f80e060
Pull Request resolved: #37106
suo added a commit that referenced this pull request Apr 30, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: a29befa
Pull Request resolved: #37106
suo added a commit that referenced this pull request May 1, 2020
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b, c)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

After this change, on the baseline introduced in
#36345 we go from ~193s to ~5.2s,
bringing us to a reasonable range (although there are still substantial
improvements that can be made).

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

ghstack-source-id: eba073e
Pull Request resolved: #37106
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@suo merged this pull request in 5efd105.

@facebook-github-bot facebook-github-bot deleted the gh/suo/304/head branch May 4, 2020 14:17
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#36345

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 7488268
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Test Plan: Imported from OSS

Differential Revision: D20952727

Pulled By: suo

fbshipit-source-id: 099f797222d7e71e5c04991584adc2c7eab5a70f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants