Skip to content

Add doc for multinode GPU training.#5704

Merged
vanbasten23 merged 4 commits intomasterfrom
addMultiHostGPUDoc
Oct 27, 2023
Merged

Add doc for multinode GPU training.#5704
vanbasten23 merged 4 commits intomasterfrom
addMultiHostGPUDoc

Conversation

@vanbasten23
Copy link
Copy Markdown
Collaborator

@vanbasten23 vanbasten23 commented Oct 16, 2023

Will do some testing first. Once the feature is more stable, I'll merge this PR>

Comment thread docs/pjrt.md
Copy link
Copy Markdown
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread docs/pjrt.md
--nnodes=2 \
--node_rank=1 \
--nproc_per_node=4 \
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet_torchrun.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_train_mp_imagenet_torchrun.py -> pytorch/xla/test/test_train_mp_imagenet.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopening - I still see test_train_mp_imagenet_torchrun

Comment thread docs/pjrt.md
`GPU_NUM_DEVICES` to the number of devices on the host. For example:

```
PJRT_DEVICE=GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still accurate with GPU_NUM_DEVICES?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @will-cromar and @jonb377 ,I know in our previous discussion, we said we should replace GPU_NUM_DEVICES with LOCAL_WORLD_SIZE. But I don't think we can do that.

The reason why is if do so, then running

PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=2 python -c 'xm.xla_device()' 

would hang because distributed runtime service expect 2 clients here but we only have 1 process/client.

How do you feel if we keep using GPU_NUM_DEVICES in the single-host-multi-GPU case?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let's leave GPU_NUM_DEVICES for single-host-multi-GPU and try to think of a better solution.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it in, I don't see a straightforward way around the issue. Do we need to modify the runtime initialization logic in the computation client to account for this?

Copy link
Copy Markdown
Collaborator

@wbmc wbmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vanbasten23 vanbasten23 requested a review from JackCaoG October 25, 2023 23:48
@vanbasten23 vanbasten23 marked this pull request as ready for review October 25, 2023 23:48
@yeounoh
Copy link
Copy Markdown
Contributor

yeounoh commented Oct 26, 2023

cc @zpcore @ManfeiBai for reference.

Copy link
Copy Markdown
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Comment thread docs/pjrt.md Outdated
Comment thread docs/pjrt.md Outdated
@vanbasten23 vanbasten23 merged commit b0dca12 into master Oct 27, 2023
@vanbasten23 vanbasten23 deleted the addMultiHostGPUDoc branch October 27, 2023 17:39
Comment thread docs/pjrt.md
You can also use `torchrun` to initiate the single-node multi-GPU training. For example,

```
PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does torchrun set all the appropriate environment variables even for single-host? MASTER_ADDR is the one I'm curious about.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if I don't specify --rdzv_endpoint, MASTER_ADDR will not be set by torchrun. In our code, if it's not set, we default to localhost.

Comment thread docs/pjrt.md

### Multi-node GPU training

**Note that this feature only works for cuda 12+**. Similar to how PyTorch uses multi-node training, you can run the command as below:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the cuda 12 constraint also apply to the single-node case?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, cuda 12 constraint only apply to multi-node case.

Comment thread docs/pjrt.md
Comment on lines +226 to +229
- `--nnodes`: how many GPU machines to be used.
- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
- `--nproc_per_node`: the number of GPU devices to be used on the current machine.
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can link to the torchrun docs here as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread docs/pjrt.md
--nnodes=2 \
--node_rank=0 \
--nproc_per_node=4 \
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @will-cromar, I saw https://github.com/pytorch/xla/pull/5732/files. Should the flag be --ddp now instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch Jon!

jonb377 pushed a commit that referenced this pull request Oct 31, 2023
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* add doc for multinode traning.

* reworded ab it

* fix comments

* emphasize that cuda12+ is needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants