Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Fix hanging trainer.test()#142

Merged
amogkam merged 14 commits intoray-project:mainfrom
amogkam:fix-hanging-test
Apr 19, 2022
Merged

Fix hanging trainer.test()#142
amogkam merged 14 commits intoray-project:mainfrom
amogkam:fix-hanging-test

Conversation

@amogkam
Copy link
Copy Markdown
Collaborator

@amogkam amogkam commented Apr 19, 2022

Closes #132

Fixes an issue where trainer.test() hangs when using multiple workers in the test DataLoader.

This issue is a bit weird as I was only able to reproduce the hanging with that exact setup. It does not occur with trainer.train() and multiple workers in the training DataLoader. It also does not occur when I set the num_workers for the test DataLoader to 0.

I'm not exactly sure what's going on, but the testing actually finishes and the program hangs at torch.distributed.destroy_process_group(). It may potentially be related to pytorch/pytorch#75097. The difference between trainer.train() and trainer.test() is that the former wraps the model in DDP while the latter doesn't (but still creates the process group).

In any case, the shudown_remote cleanup code is not actually necessary- the CUDA cache cleanup is already being called in the parent DDPSpawnPlugin on each worker, and it seems that torch.distributed.destroy_process_group() is not a Public API (and is not being called by PyTorch Lightning either).

The test added in this PR hangs prior to the changes to ray_ddp.py, but is passing after.

@amogkam amogkam changed the title Fix hanging test Fix hanging trainer.test() Apr 19, 2022
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a question. Otherwise looks good to me.



def test_test_multiple_dataloader_workers(tmpdir, ray_start_2_cpus, seed):
"""Tests trainer.test with multiple workers for data loading."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this test. It seems like BoringModel uses the test_dataloader, which uses 1 worker. But, the test docstring says that multiple workers are used for data loading.

What am I missing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry it should be >0 workers, not "multiple workers". Will update the comment.

Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amogkam amogkam merged commit d5f325a into ray-project:main Apr 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tune.run() doesn't finish with trainer.test() call.

3 participants