[fix] multinode jobs after recent lightning update by apsdehal · Pull Request #921 · facebookresearch/mmf

apsdehal · 2021-05-03T18:16:34Z

After Sasha's update of pytorch lightning on MMF master, it broke MMF codebase for multinode job. The root problem to PR Lightning-AI/pytorch-lightning#6802. The assumption that SLURM_PROCID points to worker rank is wrong as some frameworks launch their own processes later using multiprocessing spawn and have ntasks_per_node=1 set. This means that first node will have procid = 0, second node will have procid = 1 set and so on. Now, since this is used in prepare_data masking in LightningDataModule, this leads to it running on all workers on first node and thus causing inconsistencies. Now, this leads to prepare_data being called on all workers on first node instead of rank zero. Specifically, the barrier call in prepare_data, is called on first node workers but not on others leading to block later on.

This PR fixes this by ensuring on our side that we only call prepare_data on rank zero. Furthermore, this can cause further confusion, we remove sync barrier calls from download as well. Users are now supposed to handle is_master checks on their own.

Test Plan:
Tested in multinode settings.

After Sasha's update of pytorch lightning on MMF master, it broke MMF codebase for multinode job. The root problem to PR Lightning-AI/pytorch-lightning#6802. The assumption that SLURM_PROCID points to worker rank is wrong as some frameworks launch their own processes later using multiprocessing spawn and have ntasks_per_node=1 set. This means that first node will have procid = 0, second node will have procid = 1 set and so on. Now, since this is used in prepare_data masking in LightningDataModule, this leads to it running on all workers on first node and thus causing inconsistencies. Now, this leads to prepare_data being called on all workers on first node instead of rank zero. Specifically, the barrier call in prepare_data, is called on first node workers but not on others leading to block later on. This PR fixes this by ensuring on our side that we only call prepare_data on rank zero. Furthermore, this can cause further confusion, we remove sync barrier calls from download as well. Users are now supposed to handle is_master checks on their own. Test Plan: Tested in multinode settings.

hackgoofer · 2021-05-03T18:34:20Z

Do you mind sharing a repro command that shows the breakage?

apsdehal · 2021-05-03T18:39:52Z

@ytsheng You can try launching any sweep script on two nodes and it will get stuck at Loading model part.

facebook-github-bot · 2021-05-03T18:40:14Z

@apsdehal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-05-03T22:31:55Z

@apsdehal merged this pull request in 8a01258.

Summary: Followup to facebookresearch#921 for datasets which don't inherit from MMFDatasetBuilder Reviewed By: ytsheng Differential Revision: D28238562 fbshipit-source-id: 4fd609396138893da2cad30ed3cd1d6c5ce847f2

Summary: Pull Request resolved: #928 Followup to #921 for datasets which don't inherit from MMFDatasetBuilder Reviewed By: ytsheng Differential Revision: D28238562 fbshipit-source-id: 2502800fbe72a81b578aacbb38faebb9364c813f

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 3, 2021

apsdehal force-pushed the fix_multi_node branch from ba1afa1 to f1632b8 Compare May 3, 2021 18:18

facebook-github-bot closed this in 8a01258 May 3, 2021

facebook-github-bot added the Merged label May 3, 2021

apsdehal mentioned this pull request May 5, 2021

[fix] Assign config to instance in setup as well #928

Closed

apsdehal deleted the fix_multi_node branch May 12, 2021 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] multinode jobs after recent lightning update#921

[fix] multinode jobs after recent lightning update#921
apsdehal wants to merge 1 commit intomasterfrom
fix_multi_node

apsdehal commented May 3, 2021 •

edited

Loading

Uh oh!

hackgoofer commented May 3, 2021

Uh oh!

apsdehal commented May 3, 2021

Uh oh!

facebook-github-bot commented May 3, 2021

Uh oh!

facebook-github-bot commented May 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apsdehal commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hackgoofer commented May 3, 2021

Uh oh!

apsdehal commented May 3, 2021

Uh oh!

facebook-github-bot commented May 3, 2021

Uh oh!

facebook-github-bot commented May 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apsdehal commented May 3, 2021 •

edited

Loading