[fix] Add barriers before and after setup hook is run#7202
[fix] Add barriers before and after setup hook is run#7202tchaton merged 6 commits intoLightning-AI:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7202 +/- ##
=======================================
- Coverage 91% 87% -4%
=======================================
Files 198 198
Lines 12647 12725 +78
=======================================
- Hits 11531 11063 -468
- Misses 1116 1662 +546 |
|
Hi, thanks for the patch! It looks sensible, but it doesn't seem to fix the issue on my end. The issue seem to be that at this point, DDP is not yet set up so the barrier is a no-op. I think the right place to put it would be between environment setup and module setup |
|
@awaelchli @tchaton this feels like a bug that prepare data can be called without other ranks waiting for it to finish, especially if setup depends on the results of prepare data being available. how can we conditionally move the process group initialization sooner so that we can use |
Evpok
left a comment
There was a problem hiding this comment.
Don't you still need a barrier before running setup()?
|
Only the barrier before the setup hook can solve the reported issue right? The barrier after setup may be optional. |
Yes you're right. Ideally we have a barrier immediately after prepare_data, not just before setup
the barrier here could be helpful in DDP if users define layers in |
IMO inside each connector. This way the trainer code is a tiny bit easier to read :) |
6ee1c8d to
092e767
Compare
|
@Evpok mind taking another look to see if this would mitigate/solve the issue for you? |
|
@ananthsub Yes it does, thank you for that! |
|
Careful! You merged with TPU tests failing. We probably need to add an additional check in the barrier for TPU plugin. |
What does this PR do?
Reported on slack:
Currently we delay the process group initialization until
accelerator.setup_environment.prepare_datais called before, which works fine for when lightning launches subprocess/spawns the trainer processes for distributed training, but not so for when the user has launched training outside of the trainer.This adds a barrier immediately before and after the setup hook is called.
We indirectly address this by adding barriers before setup runs (to hedge against the
prepare_datacase) as well as aftersetuphooks run in case the model defines layers inside ofsetupin order to synchronize before the module is wrapped in DDP.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃