make rpc and dist-autograd multiprocess test to use both fork and spawn#25656
make rpc and dist-autograd multiprocess test to use both fork and spawn#25656zhaojuanmao wants to merge 9 commits intogh/zhaojuanmao/1/basefrom
Conversation
| #!/usr/bin/env python3 | ||
| from __future__ import absolute_import, division, print_function, unicode_literals | ||
|
|
||
| import numpy |
There was a problem hiding this comment.
why do we need this import?
There was a problem hiding this comment.
it tried to see whether it can fix test error about "numpy + intel MKL", but I removed in new diff, will see whether tests passed
| } | ||
|
|
||
| std::vector<char> generatePythonUDFResult( | ||
| const Message& request) { |
There was a problem hiding this comment.
indent, does this fit into the line above?
There was a problem hiding this comment.
ran clang-format in new diff
Pull Request resolved: #25656 1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn, so changing multiprocessing start method from fork to spawn 2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d 3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables ghstack-source-id: 89482622 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
Pull Request resolved: #25656 1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn, so changing multiprocessing start method from fork to spawn 2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d 3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables ghstack-source-id: 89487514 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
Pull Request resolved: #25656 1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn, so changing multiprocessing start method from fork to spawn 2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d 3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables ghstack-source-id: 89508433 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
| self.rank = self.MAIN_PROCESS_RANK | ||
| self.file = tempfile.NamedTemporaryFile(delete=False) | ||
| self.processes = [self._spawn_process(rank) for rank in range(int(self.world_size))] | ||
| if self.rank == self.MAIN_PROCESS_RANK: |
There was a problem hiding this comment.
Why do we need this change? Isn't the setUp method run only on the parent process?
There was a problem hiding this comment.
this was added to check whether setup() is related after tests failures, it did not help, but I think it is no harm to keep there, if child process want to set up same variables as parent in the future, they can call setup() function
| # We're retrieving a corresponding test and executing it. | ||
| getattr(self, test_name)() | ||
| if multiprocessing.get_start_method() != "spawn": | ||
| sys.exit(0) |
There was a problem hiding this comment.
Doesn't this affect all c10d tests? Also, why are we checking for this after the test is done instead of before it?
There was a problem hiding this comment.
c10d is not spawn, so it is not affected. we just change behavior for rpc, when it is spawn, we do not need to exit() here explicitly, as @xush6528 mentioned sys.exit(0) here is mainly for fork processes to avoid run teardown().
| sys.version_info < (3, 0), | ||
| "Pytorch distributed rpc package " "does not support python2", | ||
| ) | ||
| @unittest.skipIf(C.is_asan, "Skip ASAN since torch + multiprocessing + fbcode doesn't work with asan") |
There was a problem hiding this comment.
don't mention fbcode in OSS code.
| super(RpcTest, cls). setUpClass() | ||
| try: | ||
| multiprocessing.set_start_method("spawn") | ||
| except Exception: |
There was a problem hiding this comment.
Why do we ignore this exception and continue? We should throw this exception and fail the test.
There was a problem hiding this comment.
for some env, if there is no spawn, at least we should run fork?
| self.rank = rank | ||
| self.file_name = file_name | ||
|
|
||
| self.setUp() |
There was a problem hiding this comment.
Why do we need children to call self.setUp()? What was missing?
|
although tests passed locally, CI still failed with two failures: 1) asan test, although I skipped asan 2) Numpy + Intel MKL incompatibility error Tried to add some fast guessing changes (e.g., add setup() for child process as suspecting dependencies are not loaded properly; play with sys.exit(0) as suspecting process clean up has something wrong), but both of them did not work. so will try following things:
|
|
@zhaojuanmao Btw -- on the previous version of this PR I left a comment regarding the runtime. Did you happen to check how much impact the spawn start method has on the total runtime of the distributed tests? |
|
@pietern spawn is much slower, 28.67s vs 8.73s for test_rpc. |
|
@zhaojuanmao Thanks for checking. We should investigate if we can't switch to a mode where we launch processes once and reuse them across tests. There are downsides to doing that as well, but with such a runtime it's worth exploring in my opinion. |
Pull Request resolved: #25656 1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn, so changing multiprocessing start method from fork to spawn 2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d 3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables ghstack-source-id: 89587243 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
|
I tried to use torch.multiprocessing.spawn() in another PR for my investigation purpose, and all tests passed! indeed, for torch.multiprocessing, both fork and spawn methods caught crash issue and run almost the same amount of time for test_rpc (28s). native multiprocessing fork run only 8s, but it did not catch the crash issue -- which means something is wrong. So I'm preferring to go ahead with torch.multiprocessing implementation in this diff. Is 28s acceptable? I guess it should be fine now, spawn multi processes for each test case feel like it is more clean set up. The purpose of unit tests is to test correctness, so it should be fine to sacrifice some performance? in this diff, make dist_autograd to use spawn as well. tried c10d tests, most of them works fine, a few ones are a little flaky some how, will address it in separate diff |
It is OK to me for now, as |
mrshenli
left a comment
There was a problem hiding this comment.
Do we have any new tests to make sure that spawn indeed catch crashes?
| return TIMEOUT_OVERRIDE.get(test_id.split(".")[-1], TIMEOUT_DEFAULT) | ||
|
|
||
|
|
||
| class MultiProcessTestCase(TestCase): |
There was a problem hiding this comment.
Let's add some comments here to explain the difference between this MultiProcessTestCase and the one in common_distributed.py. It might be helpful to give them different names.
| } | ||
|
|
||
|
|
||
| def skip_if_not_multigpu(func): |
There was a problem hiding this comment.
Can we import the one from common_distributed, or is this one different than that?
| return wrapper | ||
|
|
||
|
|
||
| def skip_if_lt_x_gpu(x): |
There was a problem hiding this comment.
ditto, and for all the functions below.
|
@mrshenli it is very hard to find another example to show this diff catch an issue that has not been caught previously right now, but we've already have one example, is not it? eventually common_distributed.py will be killed after test_c10d.py starts to use the new multiprocessing_test_case.py, so planning not to let them share same codes, will add another PR for test_c10d.py to use multiprocessing_test_case.py |
|
@pietern regarding runtime concerns, you mentioned "We should investigate if we can't switch to a mode where we launch processes once and reuse them across tests. There are downsides to doing that as well" would you please elaborate more about your idea? how could it be possible to insert each test into launched processes? you mentioned there are downsides, what kind of downsides in your mind? for unit tests, I guess runtime is not most concerned, correctness is more important? |
Just want to make sure I understand the plan here. By "will be killed" do you mean eventually, we will still have a file named |
|
@mrshenli multiprocessing_test_case.py has all the utilities and decorators that are same as common_distributed.py, it copies from common_distributed.py with different multiProcessTestCase class implementation. Do you prefer to still name "multiprocessing_test_case.py " as "common_distributed.py"? |
I feel things like |
Pull Request resolved: #25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91037514 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
…ork and spawn" spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/) [ghstack-poisoned]
…ork and spawn" spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/) [ghstack-poisoned]
Pull Request resolved: #25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91076542 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
I think the above test failure is because we don't build rpc code for windows (see this line). Let's skip RPC tests for windows too (by adding it to the windows black list). |
…ork and spawn" spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/) [ghstack-poisoned]
Pull Request resolved: #25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91083537 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
|
@pytorchbot retest this please |
|
landing the diff, the failure is not related |
…ork and spawn" spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/) [ghstack-poisoned]
Pull Request resolved: #25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
|
@pytorchbot retest this please |
|
Hi @zhaojuanmao it seems this pr breaks pytorch master ci test, wonder if you are fixing it, or should I revert it? |
|
@zrphercule we are landing #27158 to disable the flaky test now. Please don't revert. |
|
#27158 is landed |
|
This pull request has been merged in 1d2d59d. |
…wn (pytorch#25656) Summary: Pull Request resolved: pytorch#25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Test Plan: unit tests Reviewed By: xush6528 Differential Revision: D17086007 fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
…wn (pytorch#25656) Summary: Pull Request resolved: pytorch#25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Test Plan: unit tests Reviewed By: xush6528 Differential Revision: D17086007 fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
Summary: pietern discovered that `test_invalid_names` is flaky on master. pytorch#25656 is potentially the fix. Disable this test for now and will try to add it again when pytorch#25656 is in. Pull Request resolved: pytorch#25916 Differential Revision: D17287496 Pulled By: mrshenli fbshipit-source-id: 9313958d3480c2bab20cd2341837c7821e3bb1b5
…wn (pytorch#25656) Summary: Pull Request resolved: pytorch#25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Test Plan: unit tests Reviewed By: xush6528 Differential Revision: D17086007 fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
Stack from ghstack:
spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now.
so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests
Differential Revision: D17086007