fix an infininte loop in liveness by Krovatkin · Pull Request #36697 · pytorch/pytorch

Krovatkin · 2020-04-16T01:11:14Z

This should fix #36434

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can never achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus changed_ is always set true.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes inherited the liveness sets 🤦
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0

When processing the loop, we also process this node. Its liveness sets are empty!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}

We are done with this loop. We remove the node we added

[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!

[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}

dr-ci · 2020-04-16T01:12:10Z

💊 Build failures summary and remediations

As of commit c9894be (more details on the Dr. CI page):

3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_definitions.py 
Auto-merging .circleci/cimodel/data/binary_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_xla_linux_xenial_py3_6_clang7_build (2/2)

Step: "Build" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Apr 16 18:47:04 Failed to generate ATEN bindings: ['/var/lib/jenkins/workspace/xla/scripts/generate_code.sh']

Apr 16 18:47:04 AtenXlaType function missed override: Tensor& normal_(Tensor& self, double mean, double std, c10::optional<Generator> generator); // normal_(Tensor,double,double,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 AtenXlaType function missed override: Tensor rrelu_with_noise(const Tensor& self, const Tensor& noise, Scalar lower, Scalar upper, bool training, c10::optional<Generator> generator); // rrelu_with_noise(Tensor,Tensor,Scalar,Scalar,bool,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 AtenXlaType function missed override: Tensor& uniform_(Tensor& self, double from, double to, c10::optional<Generator> generator); // uniform_(Tensor,double,double,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 Traceback (most recent call last): 
Apr 16 18:47:04   File "/var/lib/jenkins/workspace/xla/scripts/gen.py", line 1081, in <module> 
Apr 16 18:47:04     generate(args) 
Apr 16 18:47:04   File "/var/lib/jenkins/workspace/xla/scripts/gen.py", line 1051, in generate 
Apr 16 18:47:04     assert check_overrides(overrides, overridden) 
Apr 16 18:47:04 AssertionError 
Apr 16 18:47:04 Building torch_xla version: 1.6 
Apr 16 18:47:04 Failed to generate ATEN bindings: ['/var/lib/jenkins/workspace/xla/scripts/generate_code.sh'] 
Apr 16 18:47:04 + cleanup 
Apr 16 18:47:04 + retcode=1 
Apr 16 18:47:04 + set +x 
Apr 16 18:47:04 =================== sccache compilation log =================== 
Apr 16 18:47:04 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Apr 16 18:47:04 Compile requests              5472 
Apr 16 18:47:04 Compile requests executed     3296 
Apr 16 18:47:04 Cache hits                    3283 
Apr 16 18:47:04 Cache misses                     1 
Apr 16 18:47:04 Cache timeouts                   0

1 job timed out:

caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 7 times.

facebook-github-bot

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2020-04-16T14:30:09Z

This is cool! What does the diff do?

zdevito

Code looks fine, but I don't know why the original code wasn't fine, so I can't say if this is correct. I also don't see a test to make sure the infinite loop no longer exists.

zdevito · 2020-04-16T17:03:36Z

I think this should be lv.bodyBlock(). Dereferencing end() shouldn't be considered defined.

Krovatkin · 2020-04-16T17:48:01Z

fyi: @zdevito @ezyang

This is cool! What does the diff do?

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can never achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus changed_ is always set true.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes inherited the liveness sets 🤦
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0

When processing the loop, we also process this node. Its liveness sets are empty!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}

We are done with this loop. We remove the node we added

[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!

[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}

zdevito

Thanks for the explanation, can we make sure that goes in the commit message?

ezyang · 2020-04-16T18:14:13Z

I'm cool with this going in, but I would also suggest thinking about if there are any extra asserts or linting passes you could have added that would have caught this kind of error sooner. Very helpful, when working on a compiler, they are ;)

…eness and thus setting always setting changed_ to true

facebook-github-bot

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-04-17T04:14:03Z

@Krovatkin merged this pull request in 76f9528.

Summary: This should fix pytorch#36434 We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can *never* achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus `changed_` is always set `true`. Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes **inherited** the liveness sets � I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!! Here's a few excerpts from the trace. Before we enter a loop we create a node to use loop's upper bound. ``` [DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0 ``` When processing the loop, we also process this node. Its liveness sets are empty! ``` [DEBUG liveness.cpp:099] Processing node = prim::Store(%3) addr = 0x555777c19eb0 [DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {} ``` We are done with this loop. We remove the node we added ``` [DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0 ``` We are about to process the loop for the second time, so we create the use node again. Note, it's allocated at the exact same address!!! ``` [DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0 ``` Now we process it again. But now it has non-empty sets even though it's a brand new node!!!! ``` [DEBUG liveness.cpp:099] Processing node = prim::Store(%i) addr = 0x555777c19eb0 [DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10} ``` Pull Request resolved: pytorch#36697 Differential Revision: D21059313 Pulled By: Krovatkin fbshipit-source-id: b0fdeb4418e0e73f34187826877179260f21cf7b

Krovatkin requested a review from zdevito April 16, 2020 01:11

Krovatkin requested a review from apaszke as a code owner April 16, 2020 01:11

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 16, 2020

facebook-github-bot reviewed Apr 16, 2020

View reviewed changes

zdevito reviewed Apr 16, 2020

View reviewed changes

zdevito approved these changes Apr 16, 2020

View reviewed changes

fix an infininte loop in liveness due to adding nodes while doing liv…

c9894be

…eness and thus setting always setting changed_ to true

Krovatkin force-pushed the krovatkin/fix_hang branch from 74ba5d7 to c9894be Compare April 16, 2020 18:29

facebook-github-bot reviewed Apr 16, 2020

View reviewed changes

facebook-github-bot closed this in 76f9528 Apr 17, 2020

facebook-github-bot added the merged label Apr 17, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix an infininte loop in liveness#36697

fix an infininte loop in liveness#36697
Krovatkin wants to merge 1 commit intopytorch:masterfrom
Krovatkin:krovatkin/fix_hang

Krovatkin commented Apr 16, 2020 •

edited

Loading

Uh oh!

dr-ci Bot commented Apr 16, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

ezyang commented Apr 16, 2020

Uh oh!

zdevito left a comment

Uh oh!

zdevito Apr 16, 2020

Uh oh!

Krovatkin Apr 16, 2020

Uh oh!

Krovatkin commented Apr 16, 2020 •

edited

Loading

Uh oh!

zdevito left a comment

Uh oh!

ezyang commented Apr 16, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Apr 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Krovatkin commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

pytorch_xla_linux_xenial_py3_6_clang7_build (2/2)

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Apr 16, 2020

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

zdevito Apr 16, 2020

Choose a reason for hiding this comment

Uh oh!

Krovatkin Apr 16, 2020

Choose a reason for hiding this comment

Uh oh!

Krovatkin commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Apr 16, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Krovatkin commented Apr 16, 2020 •

edited

Loading

dr-ci Bot commented Apr 16, 2020 •

edited

Loading

Krovatkin commented Apr 16, 2020 •

edited

Loading