Put row lengths on the same device on gpu by edknv · Pull Request #113 · NVIDIA-Merlin/dataloader

edknv · 2023-03-21T18:22:12Z

In a multi-gpu setting, tensors may be generated on different devices. This PR forces torch.cumsum(row_lengths, 0) to be on the same device as the zero_value tensor. If they are on different devices, torch.cat() can't concatenate them, e.g.,

  File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/torch.py", line 169, in _row_lengths_to_offsets
    return torch.cat((zero_value, torch.cumsum(row_lengths, 0)))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument tensors in method wrapper_cat)

oliverholworthy · 2023-03-21T19:20:13Z

        if len(row_lengths.shape) == 2:
            zero_value = zero_value.view(-1, 1)
-        return torch.cat((zero_value, torch.cumsum(row_lengths, 0)))
+        return torch.cat((zero_value, torch.cumsum(row_lengths, 0).to(device=self.device)))


assuming torch.cat preserves the device of the values being concatented, maybe an alternative to this would be to change the line where zero_value is defined to use row_lengths.device instead of self.device?

that might make the row_lengths to offsets method more robust, but fact that the error is showing up suggests that even with this fix we might end up with a mismatch between the loader.device attribute and the output tensor device?

Thanks for the suggestion. Honestly I'm not sure which option is best, but it also works, and I think your suggestion is better. Changed in de593b2.

Put row lengths on the same device on gpu

c871b98

edknv self-assigned this Mar 21, 2023

edknv added the bug Something isn't working label Mar 21, 2023

edknv added this to the Merlin 23.03 milestone Mar 21, 2023

Add cpu test case

eca9e24

oliverholworthy reviewed Mar 21, 2023

View reviewed changes

move zero_value to device

de593b2

edknv mentioned this pull request Mar 22, 2023

Set device in dataloaders NVIDIA-Merlin/Transformers4Rec#654

Merged

oliverholworthy approved these changes Mar 22, 2023

View reviewed changes

oliverholworthy merged commit 08bd8a8 into NVIDIA-Merlin:main Mar 22, 2023

edknv mentioned this pull request Mar 23, 2023

Change row lengths unit tests to use cpu and single gpu #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put row lengths on the same device on gpu#113

Put row lengths on the same device on gpu#113
oliverholworthy merged 3 commits intoNVIDIA-Merlin:mainfrom
edknv:torch/multi-gpu-row-lengths

edknv commented Mar 21, 2023

Uh oh!

oliverholworthy Mar 21, 2023

Uh oh!

oliverholworthy Mar 21, 2023

Uh oh!

edknv Mar 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edknv commented Mar 21, 2023

Uh oh!

oliverholworthy Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

oliverholworthy Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

edknv Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants