Set device in dataloaders by edknv · Pull Request #654 · NVIDIA-Merlin/Transformers4Rec

edknv · 2023-03-22T06:53:18Z

Fixes #651

Goals ⚽

Fix multi-gpu training notebook.

Implementation Details 🚧

Depends on Put row lengths on the same device on gpu dataloader#113.
device is set identical to local_rank.
Without dropping the last batch (dataloader_drop_last=True), recsys_trainer.evaluate hangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).
torch.distributed.launch is replaced with torchrun because the former has been deprecated.

Testing Details 🔍

Manually tested in nvcr.io/nvidia/merlin/merlin-pytorch:23.02 by installing the main branch of all Merlin libraries via pip install . --no-deps. Run 01 notebook first and then run 03 notebook.

review-notebook-app · 2023-03-22T06:53:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2023-03-22T07:05:09Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-654

rnyak · 2023-03-22T13:50:52Z

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

import os
local_rank = int(os.environ["LOCAL_RANK"])

what do you think? does it make a big difference or not?

edknv · 2023-03-22T15:36:18Z

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:
import os
local_rank = int(os.environ["LOCAL_RANK"])
what do you think? does it make a big difference or not?

In our case, it doesn't look like it makes a difference. From what I can tell, torchrun seems to make use of the local rank automatically. I think the doc is saying, if you need to use the local_rank variable in your script, use local_rank = int(os.environ["LOCAL_RANK"]). Our script does not make use of this variable, so I didn't include it.

Set device in dataloaders

1478036

edknv self-assigned this Mar 22, 2023

edknv added the bug Something isn't working label Mar 22, 2023

edknv added this to the Merlin 23.03 milestone Mar 22, 2023

edknv requested a review from rnyak March 22, 2023 06:58

rnyak approved these changes Mar 22, 2023

View reviewed changes

karlhigley merged commit ff5d304 into NVIDIA-Merlin:main Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set device in dataloaders#654

Set device in dataloaders#654
karlhigley merged 1 commit intoNVIDIA-Merlin:mainfrom
edknv:loader/set_device

edknv commented Mar 22, 2023 •

edited

Loading

Uh oh!

review-notebook-app bot commented Mar 22, 2023

Uh oh!

github-actions bot commented Mar 22, 2023

Uh oh!

rnyak commented Mar 22, 2023

Uh oh!

edknv commented Mar 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edknv commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

Uh oh!

review-notebook-app bot commented Mar 22, 2023

Uh oh!

github-actions bot commented Mar 22, 2023

Documentation preview

Uh oh!

rnyak commented Mar 22, 2023

Uh oh!

edknv commented Mar 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edknv commented Mar 22, 2023 •

edited

Loading