Skip to content

make rpc and dist-autograd multiprocess test to use both fork and spawn#25656

Closed
zhaojuanmao wants to merge 9 commits intogh/zhaojuanmao/1/basefrom
gh/zhaojuanmao/1/head
Closed

make rpc and dist-autograd multiprocess test to use both fork and spawn#25656
zhaojuanmao wants to merge 9 commits intogh/zhaojuanmao/1/basefrom
gh/zhaojuanmao/1/head

Conversation

@zhaojuanmao
Copy link
Copy Markdown
Contributor

@zhaojuanmao zhaojuanmao commented Sep 4, 2019

Stack from ghstack:

spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests

Differential Revision: D17086007

@pytorchbot pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: pybind Related to our Python bindings / interactions with other Python libraries labels Sep 4, 2019
@zhaojuanmao zhaojuanmao requested a review from xush6528 September 4, 2019 18:14
Comment thread test/test_rpc.py Outdated
#!/usr/bin/env python3
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this import?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it tried to see whether it can fix test error about "numpy + intel MKL", but I removed in new diff, will see whether tests passed

}

std::vector<char> generatePythonUDFResult(
const Message& request) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent, does this fit into the line above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran clang-format in new diff

zhaojuanmao added a commit that referenced this pull request Sep 4, 2019
Pull Request resolved: #25656

1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn,
so changing multiprocessing start method from fork to spawn

2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d

3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables
ghstack-source-id: 89482622

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
zhaojuanmao added a commit that referenced this pull request Sep 4, 2019
Pull Request resolved: #25656

1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn,
so changing multiprocessing start method from fork to spawn

2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d

3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables
ghstack-source-id: 89487514

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
zhaojuanmao added a commit that referenced this pull request Sep 4, 2019
Pull Request resolved: #25656

1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn,
so changing multiprocessing start method from fork to spawn

2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d

3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables
ghstack-source-id: 89508433

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
Comment thread test/common_distributed.py Outdated
self.rank = self.MAIN_PROCESS_RANK
self.file = tempfile.NamedTemporaryFile(delete=False)
self.processes = [self._spawn_process(rank) for rank in range(int(self.world_size))]
if self.rank == self.MAIN_PROCESS_RANK:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change? Isn't the setUp method run only on the parent process?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added to check whether setup() is related after tests failures, it did not help, but I think it is no harm to keep there, if child process want to set up same variables as parent in the future, they can call setup() function

Comment thread test/common_distributed.py Outdated
# We're retrieving a corresponding test and executing it.
getattr(self, test_name)()
if multiprocessing.get_start_method() != "spawn":
sys.exit(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this affect all c10d tests? Also, why are we checking for this after the test is done instead of before it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c10d is not spawn, so it is not affected. we just change behavior for rpc, when it is spawn, we do not need to exit() here explicitly, as @xush6528 mentioned sys.exit(0) here is mainly for fork processes to avoid run teardown().

Comment thread test/test_rpc.py Outdated
sys.version_info < (3, 0),
"Pytorch distributed rpc package " "does not support python2",
)
@unittest.skipIf(C.is_asan, "Skip ASAN since torch + multiprocessing + fbcode doesn't work with asan")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't mention fbcode in OSS code.

Comment thread test/test_rpc.py Outdated
super(RpcTest, cls). setUpClass()
try:
multiprocessing.set_start_method("spawn")
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore this exception and continue? We should throw this exception and fail the test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some env, if there is no spawn, at least we should run fork?

Comment thread test/common_distributed.py Outdated
self.rank = rank
self.file_name = file_name

self.setUp()
Copy link
Copy Markdown
Contributor

@xush6528 xush6528 Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need children to call self.setUp()? What was missing?

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

although tests passed locally, CI still failed with two failures: 1) asan test, although I skipped asan 2) Numpy + Intel MKL incompatibility error

Tried to add some fast guessing changes (e.g., add setup() for child process as suspecting dependencies are not loaded properly; play with sys.exit(0) as suspecting process clean up has something wrong), but both of them did not work.

so will try following things:

  1. download docker file to debug and try to fix for existing implementation, as it looks like it is close to finish
  2. if it is found that it is hard to fix, will try to use torch.multiprocessing.spawn() - maybe it is better?

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Sep 5, 2019

@zhaojuanmao torch.multiprocessing.spawn is just a wrapper for doing multiprocessing.Process and joining them once done. It won't address more fundamental problems.

Btw -- on the previous version of this PR I left a comment regarding the runtime. Did you happen to check how much impact the spawn start method has on the total runtime of the distributed tests?

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pietern spawn is much slower, 28.67s vs 8.73s for test_rpc.

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Sep 5, 2019

@zhaojuanmao Thanks for checking. We should investigate if we can't switch to a mode where we launch processes once and reuse them across tests. There are downsides to doing that as well, but with such a runtime it's worth exploring in my opinion.

zhaojuanmao added a commit that referenced this pull request Sep 5, 2019
Pull Request resolved: #25656

1. current fork unit tests did not catch rpc exit crash issue, but spawn caught it. most of use cases also uses spawn,
so changing multiprocessing start method from fork to spawn

2. it did not change fork to spawn for c10d tests for now -- if we need to make changes for c10d, it may need another PR as c10d has a lot of unit tests, may need some more changes to make spawn work for test_c10d

3. fix the real segment exit issue caused by python rpc handler temporarily by removing module_ global variable, separate diff D17097999 is to change python rpc handler to be singletion class for deterministic destruction order of variables
ghstack-source-id: 89587243

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

I tried to use torch.multiprocessing.spawn() in another PR for my investigation purpose, and all tests passed!

indeed, for torch.multiprocessing, both fork and spawn methods caught crash issue and run almost the same amount of time for test_rpc (28s). native multiprocessing fork run only 8s, but it did not catch the crash issue -- which means something is wrong.

So I'm preferring to go ahead with torch.multiprocessing implementation in this diff. Is 28s acceptable? I guess it should be fine now, spawn multi processes for each test case feel like it is more clean set up. The purpose of unit tests is to test correctness, so it should be fine to sacrifice some performance?

in this diff, make dist_autograd to use spawn as well.

tried c10d tests, most of them works fine, a few ones are a little flaky some how, will address it in separate diff

@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Sep 6, 2019

Is 28s acceptable?

It is OK to me for now, as test_rpc.py does not have too many tests and is unlikely to grow super large very soon, but let's create an issue to track this. Let's get @pietern's input on this as well.

Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any new tests to make sure that spawn indeed catch crashes?

Comment thread test/multiprocessing_test_case.py Outdated
return TIMEOUT_OVERRIDE.get(test_id.split(".")[-1], TIMEOUT_DEFAULT)


class MultiProcessTestCase(TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comments here to explain the difference between this MultiProcessTestCase and the one in common_distributed.py. It might be helpful to give them different names.

Comment thread test/multiprocessing_test_case.py Outdated
}


def skip_if_not_multigpu(func):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import the one from common_distributed, or is this one different than that?

Comment thread test/multiprocessing_test_case.py Outdated
return wrapper


def skip_if_lt_x_gpu(x):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, and for all the functions below.

Comment thread torch/csrc/distributed/rpc/python_rpc_handler.cpp Outdated
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

zhaojuanmao commented Sep 6, 2019

@mrshenli it is very hard to find another example to show this diff catch an issue that has not been caught previously right now, but we've already have one example, is not it?

eventually common_distributed.py will be killed after test_c10d.py starts to use the new multiprocessing_test_case.py, so planning not to let them share same codes, will add another PR for test_c10d.py to use multiprocessing_test_case.py

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pietern regarding runtime concerns, you mentioned "We should investigate if we can't switch to a mode where we launch processes once and reuse them across tests. There are downsides to doing that as well"

would you please elaborate more about your idea? how could it be possible to insert each test into launched processes? you mentioned there are downsides, what kind of downsides in your mind?

for unit tests, I guess runtime is not most concerned, correctness is more important?

@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Sep 6, 2019

eventually common_distributed.py will be killed after test_c10d.py starts to use the new multiprocessing_test_case.py

Just want to make sure I understand the plan here. By "will be killed" do you mean eventually, we will still have a file named common_distributed.py, but the MultiProcessTestCase will be replaced by the one implemented here? Otherwise, if the plan is to keep MultiProcessTestCase in a separate file named multiprocessing_test_case.py, we might still need a common_distributed.py file to keep utility functions and decorators.

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@mrshenli multiprocessing_test_case.py has all the utilities and decorators that are same as common_distributed.py, it copies from common_distributed.py with different multiProcessTestCase class implementation.

Do you prefer to still name "multiprocessing_test_case.py " as "common_distributed.py"?

@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Sep 6, 2019

Do you prefer to still name "multiprocessing_test_case.py " as "common_distributed.py"?

I feel things like skip_if_lt_x_gpu and requires_gloo should be independent concepts of multiprocessing_test_case. "common_distributed.py" sounds like a more proper place to hold those utility functions.

zhaojuanmao added a commit that referenced this pull request Sep 30, 2019
Pull Request resolved: #25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91037514

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
…ork and spawn"


spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. 

so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)

[ghstack-poisoned]
…ork and spawn"


spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. 

so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 1, 2019
Pull Request resolved: #25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91076542

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Oct 1, 2019

02:34:22 Running test_rpc_fork ... [2019-10-01 02:34:23.013131]
02:34:23 Traceback (most recent call last):
02:34:23   File "test_rpc_fork.py", line 4, in <module>
02:34:23     from rpc_test import RpcTest
02:34:23   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\rpc_test.py", line 11, in <module>
02:34:23     from torch.distributed.rpc import RpcBackend
02:34:23   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\build\win_tmp\build\torch\distributed\rpc.py", line 3, in <module>
02:34:23     from . import invoke_rpc_builtin, invoke_rpc_python_udf, invoke_remote_builtin
02:34:23 ImportError: cannot import name 'invoke_rpc_builtin'
02:34:23 Traceback (most recent call last):
02:34:23   File "run_test.py", line 442, in <module>
02:34:23     main()
02:34:23   File "run_test.py", line 434, in main
02:34:23     raise RuntimeError(message)
02:34:23 RuntimeError: test_rpc_fork failed!

I think the above test failure is because we don't build rpc code for windows (see this line). Let's skip RPC tests for windows too (by adding it to the windows black list).

…ork and spawn"


spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. 

so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 1, 2019
Pull Request resolved: #25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91083537

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest this please

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

landing the diff, the failure is not related

…ork and spawn"


spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. 

so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 1, 2019
Pull Request resolved: #25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Differential Revision: [D17086007](https://our.internmc.facebook.com/intern/diff/D17086007/)
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest this please

@zrphercule
Copy link
Copy Markdown
Contributor

Hi @zhaojuanmao it seems this pr breaks pytorch master ci test, wonder if you are fixing it, or should I revert it?

@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Oct 1, 2019

@zrphercule we are landing #27158 to disable the flaky test now. Please don't revert.

@mrshenli
Copy link
Copy Markdown
Contributor

mrshenli commented Oct 1, 2019

#27158 is landed

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 1d2d59d.

@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/1/head branch October 28, 2019 22:23
pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
…wn (pytorch#25656)

Summary:
Pull Request resolved: pytorch#25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Test Plan: unit tests

Reviewed By: xush6528

Differential Revision: D17086007

fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
…wn (pytorch#25656)

Summary:
Pull Request resolved: pytorch#25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Test Plan: unit tests

Reviewed By: xush6528

Differential Revision: D17086007

fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
pietern discovered that `test_invalid_names` is flaky on master. pytorch#25656 is potentially the fix. Disable this test for now and will try to add it again when pytorch#25656 is in.
Pull Request resolved: pytorch#25916

Differential Revision: D17287496

Pulled By: mrshenli

fbshipit-source-id: 9313958d3480c2bab20cd2341837c7821e3bb1b5
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…wn (pytorch#25656)

Summary:
Pull Request resolved: pytorch#25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Test Plan: unit tests

Reviewed By: xush6528

Differential Revision: D17086007

fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: pybind Related to our Python bindings / interactions with other Python libraries module: tests Issues related to tests (not the torch.testing module) oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants