This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Merged
Conversation
bveeramani
reviewed
Apr 19, 2022
Member
bveeramani
left a comment
There was a problem hiding this comment.
Had a question. Otherwise looks good to me.
ray_lightning/tests/test_ddp.py
Outdated
|
|
||
|
|
||
| def test_test_multiple_dataloader_workers(tmpdir, ray_start_2_cpus, seed): | ||
| """Tests trainer.test with multiple workers for data loading.""" |
Member
There was a problem hiding this comment.
I'm confused by this test. It seems like BoringModel uses the test_dataloader, which uses 1 worker. But, the test docstring says that multiple workers are used for data loading.
What am I missing?
Collaborator
Author
There was a problem hiding this comment.
Ah sorry it should be >0 workers, not "multiple workers". Will update the comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #132
Fixes an issue where
trainer.test()hangs when using multiple workers in the test DataLoader.This issue is a bit weird as I was only able to reproduce the hanging with that exact setup. It does not occur with
trainer.train()and multiple workers in the training DataLoader. It also does not occur when I set thenum_workersfor the test DataLoader to 0.I'm not exactly sure what's going on, but the testing actually finishes and the program hangs at
torch.distributed.destroy_process_group(). It may potentially be related to pytorch/pytorch#75097. The difference betweentrainer.train()andtrainer.test()is that the former wraps the model in DDP while the latter doesn't (but still creates the process group).In any case, the
shudown_remotecleanup code is not actually necessary- the CUDA cache cleanup is already being called in the parentDDPSpawnPluginon each worker, and it seems thattorch.distributed.destroy_process_group()is not a Public API (and is not being called by PyTorch Lightning either).The test added in this PR hangs prior to the changes to
ray_ddp.py, but is passing after.