Delete all user forks tracked in RRefContext before shutting down#31893
Delete all user forks tracked in RRefContext before shutting down#31893xush6528 wants to merge 22 commits intogh/xush6528/55/basefrom
Conversation
In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) ghstack-source-id: 96331929 Pull Request resolved: #31893
In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 96366029 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
💊 CircleCI build failures summary and remediationsAs of commit ac7ec2d (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 144 times. |
…ng down" In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 96693731 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
…ext before shutting down" In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 96753553 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
pritamdamania87
left a comment
There was a problem hiding this comment.
Thanks for fixing this! I have a few comments inline.
mrshenli
left a comment
There was a problem hiding this comment.
Test failures are real.
Jan 16 01:22:10 ======================================================================
Jan 16 01:22:10 FAIL [1.326s]: test_nested_rref (__main__.RpcTestWithSpawn)
Jan 16 01:22:10 ----------------------------------------------------------------------
Jan 16 01:22:10 Traceback (most recent call last):
Jan 16 01:22:10 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 130, in wrapper
Jan 16 01:22:10 self._join_processes(fn)
Jan 16 01:22:10 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 211, in _join_processes
Jan 16 01:22:10 self._check_return_codes(elapsed_time)
Jan 16 01:22:10 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 231, in _check_return_codes
Jan 16 01:22:10 self.assertEqual(p.exitcode, first_process.exitcode)
Jan 16 01:22:10 File "/var/lib/jenkins/workspace/test/common_utils.py", line 890, in assertEqual
Jan 16 01:22:10 super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Jan 16 01:22:10 AssertionError: 7 not less than or equal to 1e-05 :
| return rrefId_; | ||
| } | ||
|
|
||
| virtual void tryDel() {} |
There was a problem hiding this comment.
ideally, this should only live in UserRRef. This is OK for now, as otherwise we would have to do static_cast. After #30630, we should be able to let confirmedUsers_ store UserRRef, and can then delete this from the base class.
There was a problem hiding this comment.
This was my concern too. Thanks for telling me about #30630.
…ng down" In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 96765512 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
mrshenli
left a comment
There was a problem hiding this comment.
test_nested_rref_stress has failed. Seems one worker quits too soon.
Jan 16 05:38:51 ......s...................terminate called after throwing an instance of 'gloo::IoException'
Jan 16 05:38:51 what(): [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [172.17.0.2]:59041: Connection reset by peer
Jan 16 05:38:51 terminate called after throwing an instance of 'gloo::IoException'
Jan 16 05:38:51 what(): [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [172.17.0.2]:32406: Connection reset by peer
Jan 16 05:38:56 ERROR:torch.distributed.rpc.api:worker2 failed to respond to 'Shutdown Proceed.' request in 0:00:05
Jan 16 05:38:56 ERROR:torch.distributed.rpc.api:worker3 failed to respond to 'Shutdown Proceed.' request in 0:00:05
…ext before shutting down" In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
…ng down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
…xt before shutting down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
… down Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 99956149 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
|
On this version, the error is
The assertion does not hold. |
This is weird. This would only happen when |
…ng down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
…xt before shutting down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
… down Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 99982072 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
|
merged the version in #34610, so the change compared to last night are, In RRefContext::delAllUserRRef(), copy confirmedUsers to a temp, so that we can release lock, so RRef->tryDel() and Context::delUser(..) can acquire lock, and remove the deleted UserRRef from confirmedUsers_. A more frequent call path to RRef->tryDel() is Python GC triggered UserRRef destructor, where the wrapping logic does not acquire lock at all. After sending all delete user RRef messages, should wait for number of OwnerRRefs to reach 0 before moving on. |
| TORCH_INTERNAL_ASSERT( | ||
| iter != pendingChildren_.end(), | ||
| "Inconsistent states: attempt to delete a non-exist child fork."); | ||
| pendingChildren_.erase(iter); |
There was a problem hiding this comment.
cannot directly erase the UserRRef here, because that could trigger delUser which would then acquire lock on mutex_ to modify confirmedUsers_. As the lock is already acquired in this block, it would deadlock.
#34610 solved this problem by creating a temporary c10::intrusive_ptr<RRef> deletedUser to keep the deleted UserRRef alive in this block.
There was a problem hiding this comment.
Thanks for explaining the subtlety here. Updating with comments.
| std::piecewise_construct, | ||
| std::forward_as_tuple(forkId), | ||
| std::forward_as_tuple(iter->second)); | ||
| pendingUsers_.erase(iter); |
mrshenli
left a comment
There was a problem hiding this comment.
We are getting close. If tests pass after adding c10::intrusive_ptr<RRef> deletedUser;, we can land. :)
| // First, wait for all pending UserRRefs to be confirmed, | ||
| // one kind is pendingUsers_, which are shared from Owner, | ||
| // the other kind pendingChildren_, which are shared from another User. | ||
| std::unique_lock<std::mutex> lock(mutex_); |
There was a problem hiding this comment.
looks like there is nothing that need to be shared across the three lock-unlock-lock blocks, we can use two lock_guard here to make the code more readable?
There was a problem hiding this comment.
sounds good. updating.
| // Note, there should be no new forkings in between, because it's assumed that | ||
| // this utility is called during graceful shutdown, where no new user RPCs can | ||
| // not be initiaited anymore. | ||
| auto tempConfirmedUsers = confirmedUsers_; |
There was a problem hiding this comment.
as erasing non-existing keys from unordered_map is fine, we can swap here?
There was a problem hiding this comment.
Yeah, we can. It's more performant, like move semantics, but bi-directional.
Also, actually no one is expected to insert to confirmedUsers_ again, based on the assumption that it's called during graceful shutdown.
| const RRefId& rrefId, | ||
| const TypePtr& type) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::unique_lock<std::mutex> lock(mutex_); |
| return type_; | ||
| } | ||
|
|
||
| virtual void tryDel() {} |
There was a problem hiding this comment.
Let's add comments to explain its behavior.
|
Update: Use temporary ref to delay UserRRef destruction until lock released. |
|
Added test case to show the "circular dependency" in #27096. |
…ng down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
…xt before shutting down" Stack from [ghstack](https://github.com/ezyang/ghstack): * **#31893 Delete all user forks tracked in RRefContext before shutting down** In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/) [ghstack-poisoned]
… down Pull Request resolved: #31893 In order to resolve the issue summarized in #31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 100023142 Differential Revision: [D19292254](https://our.internmc.facebook.com/intern/diff/D19292254/)
mrshenli
left a comment
There was a problem hiding this comment.
Awesome!! Finally, we got all tests passed. Thanks a lot for spending so much time on this PR. LGTM! Let's land!
|
|
||
| namespace { | ||
|
|
||
| constexpr std::chrono::milliseconds kDeleteAllUsersTimeout(100000); |
There was a problem hiding this comment.
This is looks good to me for now. In future PRs we probably would like to expose this to users for configure?
There was a problem hiding this comment.
Yes, I prefer not exposing to shutdown API at this moment.
Because I think users should only care to pass a top-level timeout, that includes
0) wait_for_all_workers, to flush any user RPC requests.
- wait for pending,
- send and wait delete user forks,
- wait for owners to be removed.
- wait for join().
This will require utils::Future to take timeout on it's wait() function, and require RpcAgent sync()/join() to take timeout as well.
|
This pull request has been merged in 4e07c35. |
…wner, instead of adding nothing to pending users (#34988) Summary: Pull Request resolved: #34988 In #31893, we introduced a confirmedUsers_ map in RRefContext. For the case that the fork is shared from the owner, there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork ``` Differential Revision: D7735909 fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d
Stack from ghstack:
In order to resolve the issue summarized in #31325.
The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.
As the first step, we want to have a weak ref tracker to track all user rrefs.
Differential Revision: D19292254