Skip to content

Delete all user forks tracked in RRefContext before shutting down#31893

Closed
xush6528 wants to merge 22 commits intogh/xush6528/55/basefrom
gh/xush6528/55/head
Closed

Delete all user forks tracked in RRefContext before shutting down#31893
xush6528 wants to merge 22 commits intogh/xush6528/55/basefrom
gh/xush6528/55/head

Conversation

@xush6528
Copy link
Contributor

@xush6528 xush6528 commented Jan 6, 2020

Stack from ghstack:

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: D19292254

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 6, 2020
In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

ghstack-source-id: 96331929
Pull Request resolved: #31893
In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 7, 2020
Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 96366029

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
@xush6528 xush6528 changed the title Add user forks tracker in RRefContext Delete all user forks tracked in RRefContext before shutting down Jan 7, 2020
@kostmo
Copy link
Member

kostmo commented Jan 7, 2020

💊 CircleCI build failures summary and remediations

As of commit ac7ec2d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 144 times.

…ng down"


In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 15, 2020
Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 96693731

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
@xush6528 xush6528 changed the title Delete all user forks tracked in RRefContext before shutting down [WIP, debugging] Delete all user forks tracked in RRefContext before shutting down Jan 15, 2020
@xush6528 xush6528 requested a review from osalpekar January 15, 2020 23:57
…ext before shutting down"


In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 16, 2020
Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 96753553

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
@xush6528 xush6528 changed the title [WIP, debugging] Delete all user forks tracked in RRefContext before shutting down Delete all user forks tracked in RRefContext before shutting down Jan 16, 2020
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I have a few comments inline.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failures are real.

Jan 16 01:22:10 ======================================================================
Jan 16 01:22:10 FAIL [1.326s]: test_nested_rref (__main__.RpcTestWithSpawn)
Jan 16 01:22:10 ----------------------------------------------------------------------
Jan 16 01:22:10 Traceback (most recent call last):
Jan 16 01:22:10   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 130, in wrapper
Jan 16 01:22:10     self._join_processes(fn)
Jan 16 01:22:10   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 211, in _join_processes
Jan 16 01:22:10     self._check_return_codes(elapsed_time)
Jan 16 01:22:10   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 231, in _check_return_codes
Jan 16 01:22:10     self.assertEqual(p.exitcode, first_process.exitcode)
Jan 16 01:22:10   File "/var/lib/jenkins/workspace/test/common_utils.py", line 890, in assertEqual
Jan 16 01:22:10     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Jan 16 01:22:10 AssertionError: 7 not less than or equal to 1e-05 : 

return rrefId_;
}

virtual void tryDel() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, this should only live in UserRRef. This is OK for now, as otherwise we would have to do static_cast. After #30630, we should be able to let confirmedUsers_ store UserRRef, and can then delete this from the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my concern too. Thanks for telling me about #30630.

…ng down"


In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 16, 2020
Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 96765512

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_nested_rref_stress has failed. Seems one worker quits too soon.

Jan 16 05:38:51 ......s...................terminate called after throwing an instance of 'gloo::IoException'
Jan 16 05:38:51   what():  [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [172.17.0.2]:59041: Connection reset by peer
Jan 16 05:38:51 terminate called after throwing an instance of 'gloo::IoException'
Jan 16 05:38:51   what():  [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [172.17.0.2]:32406: Connection reset by peer
Jan 16 05:38:56 ERROR:torch.distributed.rpc.api:worker2 failed to respond to 'Shutdown Proceed.' request in 0:00:05
Jan 16 05:38:56 ERROR:torch.distributed.rpc.api:worker3 failed to respond to 'Shutdown Proceed.' request in 0:00:05

@xush6528 xush6528 changed the title Delete all user forks tracked in RRefContext before shutting down [WIP, debugging] Delete all user forks tracked in RRefContext before shutting down Jan 16, 2020
…ext before shutting down"


In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
…ng down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 11, 2020
…xt before shutting down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 11, 2020
… down

Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 99956149

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
@xush6528
Copy link
Contributor Author

xush6528 commented Mar 11, 2020

On this version, the error is

[E rref_impl.cpp:91] Error occurred when deleting UserRRef instance, RRefId = GloballyUniqueId(0, 0), ForkId = GloballyUniqueId(0, 1) : responseMessageFuturePtr != nullptr INTERNAL ASSERT FAILED at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_impl.cpp:87, please report a bug to PyTorch. (tryDel at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_impl.cpp:87)

The assertion does not hold.

@mrshenli
Copy link
Contributor

On this version, the error is

[E rref_impl.cpp:91] Error occurred when deleting UserRRef instance, RRefId = GloballyUniqueId(0, 0), ForkId = GloballyUniqueId(0, 1) : responseMessageFuturePtr != nullptr INTERNAL ASSERT FAILED at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_impl.cpp:87, please report a bug to PyTorch. (tryDel at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_impl.cpp:87)

The assertion does not hold.

This is weird. This would only happen when deletedOnOwner_ == false and destroyed_==true? It means the RRefContext has been destroyed but somehow there are still UserRRefs that have never called tryDel?

…ng down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 11, 2020
…xt before shutting down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 11, 2020
… down

Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 99982072

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
@xush6528
Copy link
Contributor Author

merged the version in #34610, so the change compared to last night are,

In RRefContext::delAllUserRRef(), copy confirmedUsers to a temp, so that we can release lock, so RRef->tryDel() and Context::delUser(..) can acquire lock, and remove the deleted UserRRef from confirmedUsers_.

A more frequent call path to RRef->tryDel() is Python GC triggered UserRRef destructor, where the wrapping logic does not acquire lock at all.

After sending all delete user RRef messages, should wait for number of OwnerRRefs to reach 0 before moving on.

TORCH_INTERNAL_ASSERT(
iter != pendingChildren_.end(),
"Inconsistent states: attempt to delete a non-exist child fork.");
pendingChildren_.erase(iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot directly erase the UserRRef here, because that could trigger delUser which would then acquire lock on mutex_ to modify confirmedUsers_. As the lock is already acquired in this block, it would deadlock.

#34610 solved this problem by creating a temporary c10::intrusive_ptr<RRef> deletedUser to keep the deleted UserRRef alive in this block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the subtlety here. Updating with comments.

std::piecewise_construct,
std::forward_as_tuple(forkId),
std::forward_as_tuple(iter->second));
pendingUsers_.erase(iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting close. If tests pass after adding c10::intrusive_ptr<RRef> deletedUser;, we can land. :)

// First, wait for all pending UserRRefs to be confirmed,
// one kind is pendingUsers_, which are shared from Owner,
// the other kind pendingChildren_, which are shared from another User.
std::unique_lock<std::mutex> lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there is nothing that need to be shared across the three lock-unlock-lock blocks, we can use two lock_guard here to make the code more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. updating.

// Note, there should be no new forkings in between, because it's assumed that
// this utility is called during graceful shutdown, where no new user RPCs can
// not be initiaited anymore.
auto tempConfirmedUsers = confirmedUsers_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as erasing non-existing keys from unordered_map is fine, we can swap here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can. It's more performant, like move semantics, but bi-directional.
Also, actually no one is expected to insert to confirmedUsers_ again, based on the assumption that it's called during graceful shutdown.

const RRefId& rrefId,
const TypePtr& type) {
std::lock_guard<std::mutex> lock(mutex_);
std::unique_lock<std::mutex> lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one debugging vestige in #34610. I feel we don't need this in this PR. Let me test in #34610. If not necessary, let's revert these three lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting.

return type_;
}

virtual void tryDel() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comments to explain its behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test for circular dependency? The example in #27096 should work.

@xush6528
Copy link
Contributor Author

Update: Use temporary ref to delay UserRRef destruction until lock released.

@xush6528
Copy link
Contributor Author

xush6528 commented Mar 12, 2020

Added test case to show the "circular dependency" in #27096.
To be precise, it's because Python GC doesn't collect local circular dependent object, unless explicitly call gc.collect(). UserRRefs held by such Python objects will not be deleted in time.

…ng down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 12, 2020
…xt before shutting down"

Stack from [ghstack](https://github.com/ezyang/ghstack):

* **#31893 Delete all user forks tracked in RRefContext before shutting down**

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Mar 12, 2020
… down

Pull Request resolved: #31893

In order to resolve the issue summarized in #31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 100023142

Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! Finally, we got all tests passed. Thanks a lot for spending so much time on this PR. LGTM! Let's land!


namespace {

constexpr std::chrono::milliseconds kDeleteAllUsersTimeout(100000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looks good to me for now. In future PRs we probably would like to expose this to users for configure?

Copy link
Contributor Author

@xush6528 xush6528 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I prefer not exposing to shutdown API at this moment.

Because I think users should only care to pass a top-level timeout, that includes
0) wait_for_all_workers, to flush any user RPC requests.

  1. wait for pending,
  2. send and wait delete user forks,
  3. wait for owners to be removed.
  4. wait for join().

This will require utils::Future to take timeout on it's wait() function, and require RpcAgent sync()/join() to take timeout as well.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 4e07c35.

@facebook-github-bot facebook-github-bot deleted the gh/xush6528/55/head branch March 16, 2020 14:16
facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2020
…wner, instead of adding nothing to pending users (#34988)

Summary:
Pull Request resolved: #34988

In #31893, we introduced a confirmedUsers_ map in RRefContext.

For the case that the fork is shared from the owner,  there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork
```

Differential Revision: D7735909

fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants