You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Without dropping the last batch (dataloader_drop_last=True), recsys_trainer.evaluate hangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).
torch.distributed.launch is replaced with torchrun because the former has been deprecated.
Testing Details 🔍
Manually tested in nvcr.io/nvidia/merlin/merlin-pytorch:23.02 by installing the main branch of all Merlin libraries via pip install . --no-deps. Run 01 notebook first and then run 03 notebook.
@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:
import os
local_rank = int(os.environ["LOCAL_RANK"])
what do you think? does it make a big difference or not?
@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:
import os
local_rank = int(os.environ["LOCAL_RANK"])
what do you think? does it make a big difference or not?
In our case, it doesn't look like it makes a difference. From what I can tell, torchrun seems to make use of the local rank automatically. I think the doc is saying, if you need to use the local_rank variable in your script, use local_rank = int(os.environ["LOCAL_RANK"]). Our script does not make use of this variable, so I didn't include it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #651
Goals ⚽
Fix multi-gpu training notebook.
Implementation Details 🚧
deviceis set identical tolocal_rank.dataloader_drop_last=True),recsys_trainer.evaluatehangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).torch.distributed.launchis replaced withtorchrunbecause the former has been deprecated.Testing Details 🔍
Manually tested in
nvcr.io/nvidia/merlin/merlin-pytorch:23.02by installing the main branch of all Merlin libraries viapip install . --no-deps. Run 01 notebook first and then run 03 notebook.