Skip to content

[fix] multinode jobs after recent lightning update#921

Closed
apsdehal wants to merge 1 commit intomasterfrom
fix_multi_node
Closed

[fix] multinode jobs after recent lightning update#921
apsdehal wants to merge 1 commit intomasterfrom
fix_multi_node

Conversation

@apsdehal
Copy link
Copy Markdown
Contributor

@apsdehal apsdehal commented May 3, 2021

After Sasha's update of pytorch lightning on MMF master, it broke MMF codebase for multinode job. The root problem to PR Lightning-AI/pytorch-lightning#6802. The assumption that SLURM_PROCID points to worker rank is wrong as some frameworks launch their own processes later using multiprocessing spawn and have ntasks_per_node=1 set. This means that first node will have procid = 0, second node will have procid = 1 set and so on. Now, since this is used in prepare_data masking in LightningDataModule, this leads to it running on all workers on first node and thus causing inconsistencies. Now, this leads to prepare_data being called on all workers on first node instead of rank zero. Specifically, the barrier call in prepare_data, is called on first node workers but not on others leading to block later on.

This PR fixes this by ensuring on our side that we only call prepare_data on rank zero. Furthermore, this can cause further confusion, we remove sync barrier calls from download as well. Users are now supposed to handle is_master checks on their own.

Test Plan:
Tested in multinode settings.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 3, 2021
After Sasha's update of pytorch lightning on MMF master,
it broke MMF codebase for multinode job. The root problem to PR
Lightning-AI/pytorch-lightning#6802.
The assumption that SLURM_PROCID points to worker rank is wrong as some
frameworks launch their own processes later using multiprocessing spawn
and have ntasks_per_node=1 set. This means that first node will have
procid = 0, second node will have procid = 1 set and so on. Now,
since this is used in prepare_data masking in LightningDataModule,
this leads to it running on all workers on first node and thus causing inconsistencies.
Now, this leads to prepare_data being called on all workers on first
node instead of rank zero.
Specifically, the barrier call in prepare_data, is called on first node
workers but not on others leading to block later on.

This PR fixes this by ensuring on our side that we only call
prepare_data on rank zero. Furthermore, this can cause further
confusion, we remove sync barrier calls from download as well. Users are
now supposed to handle is_master checks on their own.

Test Plan:
Tested in multinode settings.
@hackgoofer
Copy link
Copy Markdown
Contributor

Do you mind sharing a repro command that shows the breakage?

@apsdehal
Copy link
Copy Markdown
Contributor Author

apsdehal commented May 3, 2021

@ytsheng You can try launching any sweep script on two nodes and it will get stuck at Loading model part.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@apsdehal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@apsdehal merged this pull request in 8a01258.

apsdehal added a commit to apsdehal/mmf that referenced this pull request May 5, 2021
Summary: Followup to facebookresearch#921 for datasets which don't inherit from MMFDatasetBuilder

Reviewed By: ytsheng

Differential Revision: D28238562

fbshipit-source-id: 4fd609396138893da2cad30ed3cd1d6c5ce847f2
facebook-github-bot pushed a commit that referenced this pull request May 6, 2021
Summary:
Pull Request resolved: #928

Followup to #921 for datasets which don't inherit from MMFDatasetBuilder

Reviewed By: ytsheng

Differential Revision: D28238562

fbshipit-source-id: 2502800fbe72a81b578aacbb38faebb9364c813f
@apsdehal apsdehal deleted the fix_multi_node branch May 12, 2021 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants