[SPMD] auto-sharding PoC#6719

Merged

yeounoh merged 39 commits intomasterfrom

Mar 14, 2024

Contributor

yeounoh commented Mar 12, 2024 •

edited

Loading

This implemented a PoC prototype on XLA:TPU, as described in #6322

PyTorch/XLA auto-sharding can be enabled by one of the following:

Setting envvar XLA_SPMD_AUTO=1
Calling the SPMD API in the beginning of your code:

import torch_xla.runtime as xr
xr.use_spmd(auto=True)

Calling pytorch.distributed._tensor.distribute_module with auto-policy and xla:

import torch_xla.runtime as xr
from torch.distributed._tensor import DeviceMesh, distribute_module
from torch_xla.distributed.spmd import auto_policy

device_count = xr.global_runtime_device_count()
device_mesh = DeviceMesh("xla", list(range(device_count)))

# Currently, model should be loaded to xla device via distribute_module.
model = MyModule()  # nn.module
sharded_model = distribute_module(model, device_mesh, auto_policy)

Some notable limitations that we will address in follow-ups:

XLA:GPU is not supported
TPU pod is not supported

yeounoh added the distributed label

yeounoh requested a review from JackCaoG

March 12, 2024 00:21

yeounoh self-assigned this

yeounoh marked this pull request as draft

March 12, 2024 00:22

yeounoh force-pushed the spmd_auto_alpa branch 2 times, most recently from 126ceee to 4d568ef Compare

March 12, 2024 00:25

yeounoh commented

View reviewed changes

WORKSPACE Outdated

yeounoh commented

View reviewed changes

setup.py Outdated

yeounoh force-pushed the spmd_auto_alpa branch 2 times, most recently from 6ca8f97 to d6dc442 Compare

March 12, 2024 00:38

yeounoh commented

View reviewed changes

test/spmd/test_dynamo_spmd.py

yeounoh commented

View reviewed changes

test/spmd/test_spmd_graph_dump.py

yeounoh force-pushed the spmd_auto_alpa branch from d6dc442 to 319062e Compare

March 12, 2024 00:40

yeounoh commented

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated

yeounoh commented

View reviewed changes

torch_xla/csrc/runtime/profiler.cc

yeounoh force-pushed the spmd_auto_alpa branch from 91bf28d to 9db4bab Compare

March 12, 2024 01:04

yeounoh commented

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp

yeounoh force-pushed the spmd_auto_alpa branch 12 times, most recently from 303b239 to d3c1d70 Compare

March 12, 2024 07:34

yeounoh added the backport_2.3 label

yeounoh added 28 commits

March 14, 2024 00:49


          Merge master

273617a


          Add debugging probs

e171e1f


          Avoid syncing auto-generated node sharding back to data nodes.

18507a0


          OuputShardingPropagation can sync sharding from IR node after auto-sh…

d7d6529

…adrding


          * Set default XLATensor sharding to UNKNOWN

e2ab353

* Assume REPLICATED for UNKNOWN during paramter resharding


          Handle tuple shapes in resharding

fc17109


          Disable tuple in resharding

9312c41


          Remove debugging stubs

18bea46


          Sync after resharding data

4fb891a


          * Move OpenXLA to 1fc74e9890cd7785945fa39de9a3b54659f3e792, to apply …

2b3494e

…patch

* Ungroup resharding ops

* Replace device data after resharding


          Move XLA pin to 075d25e0c19e4e455ba0a2bcc432d581128e66aa

ee0a198

Delete quantization openxla patch

Debugging probes


          * Skip resharding if not using SPMD.

ecc1760

* Disable parameter wrapping with auto-sharding


          Debug wrapping

e838e32


          Build openxla from local

71ae2c0


          Reshard UNKNOWN to REPLICATED

03af881


          Verify the use of UNKNOWN sharding type as auto-sharding pass does no…

8e29449

…t fully support it, yet


          Unittests for simple linear training

577c178


          use UNKNOWN sharding type for device data

5df0e95


          Enable unittests for cpu

a948778


          *Enable test_xla_auto_sharding.py

edfc8ef

* Linter fix


          Introduce torch_xla.distributed.spmd.auto_policy to enable auto-sharding

1cd02b9


          Add _xla_get_auto_sharding

c42e77f


          Fix errors after rebasing

8efb664


          Test auto_policy & register SPMD to device mapper

d30a721


          Use DTensor API directly

42423d3


          Linter & refactor

7899faf


          Update spmd.md doc

542ae10


          remove aten bridge change

557161d

JackCaoG reviewed

View reviewed changes

test/run_tests.sh

                 run_test "$CDIR/spmd/test_xla_distributed_checkpoint.py"
                 run_test "$CDIR/spmd/test_xla_spmd_python_api_interaction.py"
                 run_test "$CDIR/spmd/test_dtensor_integration.py"
+                run_test "$CDIR/spmd/test_dtensor_integration2.py"

Collaborator

JackCaoG Mar 14, 2024

do we need this on TPU CI as well or it is ok to leave out?

Contributor Author

yeounoh Mar 14, 2024

Ohhh i think it's ok to leave out. Want to run this sanity check on TPU!

JackCaoG approved these changes

View reviewed changes

Collaborator

JackCaoG left a comment

Feel free to adjust remaining comments in a follow up [r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport_2.3 distributed