Skip to content

fix an infininte loop in liveness#36697

Closed
Krovatkin wants to merge 1 commit intopytorch:masterfrom
Krovatkin:krovatkin/fix_hang
Closed

fix an infininte loop in liveness#36697
Krovatkin wants to merge 1 commit intopytorch:masterfrom
Krovatkin:krovatkin/fix_hang

Conversation

@Krovatkin
Copy link
Copy Markdown
Contributor

@Krovatkin Krovatkin commented Apr 16, 2020

This should fix #36434

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can never achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus changed_ is always set true.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes inherited the liveness sets 🤦
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0

When processing the loop, we also process this node. Its liveness sets are empty!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}

We are done with this loop. We remove the node we added

[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!

[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}

@Krovatkin Krovatkin requested a review from zdevito April 16, 2020 01:11
@Krovatkin Krovatkin requested a review from apaszke as a code owner April 16, 2020 01:11
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 16, 2020
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 16, 2020

💊 Build failures summary and remediations

As of commit c9894be (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_definitions.py 
Auto-merging .circleci/cimodel/data/binary_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_build (2/2)

Step: "Build" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Apr 16 18:47:04 Failed to generate ATEN bindings: ['/var/lib/jenkins/workspace/xla/scripts/generate_code.sh']
Apr 16 18:47:04 AtenXlaType function missed override: Tensor& normal_(Tensor& self, double mean, double std, c10::optional<Generator> generator); // normal_(Tensor,double,double,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 AtenXlaType function missed override: Tensor rrelu_with_noise(const Tensor& self, const Tensor& noise, Scalar lower, Scalar upper, bool training, c10::optional<Generator> generator); // rrelu_with_noise(Tensor,Tensor,Scalar,Scalar,bool,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 AtenXlaType function missed override: Tensor& uniform_(Tensor& self, double from, double to, c10::optional<Generator> generator); // uniform_(Tensor,double,double,c10::optional<Generator>)->Tensor 
Apr 16 18:47:04 Traceback (most recent call last): 
Apr 16 18:47:04   File "/var/lib/jenkins/workspace/xla/scripts/gen.py", line 1081, in <module> 
Apr 16 18:47:04     generate(args) 
Apr 16 18:47:04   File "/var/lib/jenkins/workspace/xla/scripts/gen.py", line 1051, in generate 
Apr 16 18:47:04     assert check_overrides(overrides, overridden) 
Apr 16 18:47:04 AssertionError 
Apr 16 18:47:04 Building torch_xla version: 1.6 
Apr 16 18:47:04 Failed to generate ATEN bindings: ['/var/lib/jenkins/workspace/xla/scripts/generate_code.sh'] 
Apr 16 18:47:04 + cleanup 
Apr 16 18:47:04 + retcode=1 
Apr 16 18:47:04 + set +x 
Apr 16 18:47:04 =================== sccache compilation log =================== 
Apr 16 18:47:04 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Apr 16 18:47:04 Compile requests              5472 
Apr 16 18:47:04 Compile requests executed     3296 
Apr 16 18:47:04 Cache hits                    3283 
Apr 16 18:47:04 Cache misses                     1 
Apr 16 18:47:04 Cache timeouts                   0 

1 job timed out:

  • caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 7 times.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 16, 2020

This is cool! What does the diff do?

Copy link
Copy Markdown
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, but I don't know why the original code wasn't fine, so I can't say if this is correct. I also don't see a test to make sure the infinite loop no longer exists.

Comment thread torch/csrc/jit/passes/liveness.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be lv.bodyBlock(). Dereferencing end() shouldn't be considered defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Krovatkin
Copy link
Copy Markdown
Contributor Author

Krovatkin commented Apr 16, 2020

fyi: @zdevito @ezyang

This is cool! What does the diff do?

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can never achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus changed_ is always set true.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes inherited the liveness sets 🤦
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0

When processing the loop, we also process this node. Its liveness sets are empty!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}

We are done with this loop. We remove the node we added

[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!

[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}

Copy link
Copy Markdown
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, can we make sure that goes in the commit message?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 16, 2020

I'm cool with this going in, but I would also suggest thinking about if there are any extra asserts or linting passes you could have added that would have caught this kind of error sooner. Very helpful, when working on a compiler, they are ;)

…eness and thus setting always setting changed_ to true
@Krovatkin Krovatkin force-pushed the krovatkin/fix_hang branch from 74ba5d7 to c9894be Compare April 16, 2020 18:29
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@Krovatkin merged this pull request in 76f9528.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
This should fix pytorch#36434

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can *never* achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus `changed_` is always set `true`.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes **inherited** the liveness sets �
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

```
[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0
```

When processing the loop, we also process this node. Its liveness sets are empty!
```
[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}
```

We are done with this loop. We remove the node we added
```
[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0
```

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!
```
[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0
```

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

```
[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}
```
Pull Request resolved: pytorch#36697

Differential Revision: D21059313

Pulled By: Krovatkin

fbshipit-source-id: b0fdeb4418e0e73f34187826877179260f21cf7b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent hang in BuildLivenessSets after test_svd_cuda_float64 on CUDA 10.1 configurations

5 participants