Add doc for multinode GPU training.#5704
Conversation
| --nnodes=2 \ | ||
| --node_rank=1 \ | ||
| --nproc_per_node=4 \ | ||
| --rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet_torchrun.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
test_train_mp_imagenet_torchrun.py -> pytorch/xla/test/test_train_mp_imagenet.py
There was a problem hiding this comment.
Reopening - I still see test_train_mp_imagenet_torchrun
| `GPU_NUM_DEVICES` to the number of devices on the host. For example: | ||
|
|
||
| ``` | ||
| PJRT_DEVICE=GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Is this still accurate with GPU_NUM_DEVICES?
There was a problem hiding this comment.
hi @will-cromar and @jonb377 ,I know in our previous discussion, we said we should replace GPU_NUM_DEVICES with LOCAL_WORLD_SIZE. But I don't think we can do that.
The reason why is if do so, then running
PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=2 python -c 'xm.xla_device()'
would hang because distributed runtime service expect 2 clients here but we only have 1 process/client.
How do you feel if we keep using GPU_NUM_DEVICES in the single-host-multi-GPU case?
There was a problem hiding this comment.
That's a good point. Let's leave GPU_NUM_DEVICES for single-host-multi-GPU and try to think of a better solution.
There was a problem hiding this comment.
Let's leave it in, I don't see a straightforward way around the issue. Do we need to modify the runtime initialization logic in the computation client to account for this?
de2a842 to
cb38e99
Compare
|
cc @zpcore @ManfeiBai for reference. |
| You can also use `torchrun` to initiate the single-node multi-GPU training. For example, | ||
|
|
||
| ``` | ||
| PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Does torchrun set all the appropriate environment variables even for single-host? MASTER_ADDR is the one I'm curious about.
There was a problem hiding this comment.
Here if I don't specify --rdzv_endpoint, MASTER_ADDR will not be set by torchrun. In our code, if it's not set, we default to localhost.
|
|
||
| ### Multi-node GPU training | ||
|
|
||
| **Note that this feature only works for cuda 12+**. Similar to how PyTorch uses multi-node training, you can run the command as below: |
There was a problem hiding this comment.
Does the cuda 12 constraint also apply to the single-node case?
There was a problem hiding this comment.
afaik, cuda 12 constraint only apply to multi-node case.
| - `--nnodes`: how many GPU machines to be used. | ||
| - `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1. | ||
| - `--nproc_per_node`: the number of GPU devices to be used on the current machine. | ||
| - `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine. |
There was a problem hiding this comment.
Maybe we can link to the torchrun docs here as well.
| --nnodes=2 \ | ||
| --node_rank=0 \ | ||
| --nproc_per_node=4 \ | ||
| --rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
cc @will-cromar, I saw https://github.com/pytorch/xla/pull/5732/files. Should the flag be --ddp now instead?
There was a problem hiding this comment.
Thanks for the catch Jon!
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
Will do some testing first. Once the feature is more stable, I'll merge this PR>