Skip to content

Add NCCL support for pytorch_distributed_simple.py#150

Closed
toshihikoyanase wants to merge 6 commits intooptuna:mainfrom
toshihikoyanase:add-distributed-support-for-pytorch-distributed-simple
Closed

Add NCCL support for pytorch_distributed_simple.py#150
toshihikoyanase wants to merge 6 commits intooptuna:mainfrom
toshihikoyanase:add-distributed-support-for-pytorch-distributed-simple

Conversation

@toshihikoyanase
Copy link
Copy Markdown
Member

Motivation

Alternative approach for #145.

Description of the changes

  • Use device argument of optuna.integration.TorchDistributedTrial
  • Check local rank to specify CUDA device in a node
  • Use MASTER_ADDR and MASTER_PORT in environment variables if exist

I confirmed that this script worked with 2 nodes.

$ mpiexec -x OMP_NUM_THREADS=1 --bind-to none \
    -x MASTER_ADDR="your master address" \
    -x MASTER_PORT="your master port" \
    python pytorch/pytorch_distributed_simple.py

See
https://pytorch.org/docs/stable/distributed.html

@HideakiImamura
Copy link
Copy Markdown
Member

@not522 Could you review this PR?

@toshihikoyanase
Copy link
Copy Markdown
Member Author

toshihikoyanase commented Dec 1, 2022

I found that this code did not work on the PC which had NVIDIA GPUs without NCCL support.
If you have any idea to check NCCL availability, please let me know.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 8, 2022

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Dec 8, 2022
@not522 not522 removed the stale Exempt from stale bot labeling. label Dec 15, 2022
@github-actions
Copy link
Copy Markdown

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Dec 22, 2022
@not522 not522 removed the stale Exempt from stale bot labeling. label Dec 23, 2022
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 1, 2023

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 1, 2023
@not522 not522 removed the stale Exempt from stale bot labeling. label Jan 10, 2023
@github-actions
Copy link
Copy Markdown

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 17, 2023
@not522 not522 removed the stale Exempt from stale bot labeling. label Jan 31, 2023
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 7, 2023

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 7, 2023
@not522
Copy link
Copy Markdown
Member

not522 commented Feb 8, 2023

Sorry for my late response. I have been busy these days and don't have time to review this PR, so could you please reassign the reviewer?

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Feb 9, 2023
@github-actions
Copy link
Copy Markdown

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 19, 2023
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2023

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

@github-actions github-actions bot closed this Mar 6, 2023
@c-bata c-bata reopened this Mar 13, 2023
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Mar 13, 2023
@c-bata c-bata assigned c-bata and unassigned not522 Mar 13, 2023
@c-bata
Copy link
Copy Markdown
Member

c-bata commented Mar 13, 2023

I will review this PR along with optuna/optuna#4268. @toshihikoyanase Could you fix CI errors?

@toshihikoyanase
Copy link
Copy Markdown
Member Author

Let me close this PR since it uses the old API. I'll create a new one.

@toshihikoyanase toshihikoyanase deleted the add-distributed-support-for-pytorch-distributed-simple branch March 13, 2023 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants