Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

GPU Training Fixes#121

Merged
amogkam merged 23 commits intoray-project:mainfrom
amogkam:1.5-gpu
Jan 25, 2022
Merged

GPU Training Fixes#121
amogkam merged 23 commits intoray-project:mainfrom
amogkam:1.5-gpu

Conversation

@amogkam
Copy link
Copy Markdown
Collaborator

@amogkam amogkam commented Jan 22, 2022

Various fixes for distributed GPU Training

  • Supports GPU training with PTL 1.5
  • Fixes NCCL errors when training with multiple GPUs on the same node by sharing the cuda visible devices for all workers on the same node. Closes unhandled cuda error, NCCL version 2.7.8 #61.
  • Fixes Horovod GPU Training

All GPU tests were run manually and are passing.

@amogkam amogkam changed the title 1.5 gpu GPU Training Fixes Jan 22, 2022
Copy link
Copy Markdown
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look reasonable to me!

Comment on lines +517 to +518
# Adjust for if there are multiple GPUs per worker.
device_id = self.local_rank + (self.num_gpus_per_worker - 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intended logic here?

If num_gpus_per_worker=n this would give [n-1, n, n+1, ...]?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah idk what I was thinking here 😅. Fixed with correct math and added a test for this.

@amogkam amogkam merged commit 4d86700 into ray-project:main Jan 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unhandled cuda error, NCCL version 2.7.8

2 participants