[SPMD] Hybrid Device mesh creation by khatwanimohit · Pull Request #5147 · pytorch/xla

khatwanimohit · 2023-06-08T20:33:00Z

No description provided.

khatwanimohit · 2023-06-08T20:33:41Z

jonb377

Looking great Mohit! Could we also add some basic unit tests in https://github.com/pytorch/xla/blob/master/test/spmd/test_xla_sharding.py?

alanwaketan

It will help me easier to read your code if you can briefly describe what you did in the PR especially on describing some of the key complex functionalities in the code, like _create_hybrid_device_mesh.

Or you can leave a comment on the code.

alanwaketan · 2023-06-12T21:27:03Z

+      out[coords[0], coords[1], coords[2]] = d
+    return out
+
+  def _create_device_mesh_for_nd_torus(


Can you explain how this function optimize the performance according to the TPU physical topology? What's the algorithm? Is it the inner ring has the highest performance, so we should assign the back of the mesh_shape to it?

Speaking with Mohit offline. The rule is that the TPU topology is always 3D. And the inner 2D tensors have a faster ICI than the ones connect across them. Therefore, we should group the most speed demanding rank, i.e., highest rank of the mesh, to the inner 2D tensors.

Now that I read more into the code. This algorithm seems quite restrict:

It only works with mapping a 2D or 3D logical mesh into the 3D physical mesh.

Then for 3D mesh, I think the logical mesh needs to be a transpose of the physical mesh.

Then for 2D mesh, it's just trying to map a combination of the axes into each of the dimension of the logical mesh.

After these simple rules, it then makes sure that devices that are physically close to each other are assigned close to each other in the logical mesh as well. For example, assuming the logical mesh is 2D, the devices that are in mesh[0] are always be a 2D slice of the 3D physical mesh.

If my understanding is correct, @khatwanimohit can you polish my comments and make it into the comment of this helper?

You can add:

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L64.

alanwaketan · 2023-06-14T22:11:52Z

+    hybrid_mesh = xs.HybridMesh(
+        ici_mesh_shape=(1, 4), dcn_mesh_shape=(num_slices, 1))
+    print(hybrid_mesh.get_logical_mesh())
+    self.assertEqual(hybrid_mesh.get_logical_mesh().tolist(),


Does this result respect the _create_device_mesh_for_nd_torus algorithm?

Yes, I have confirmed this with the jax's mesh

Can you make the ici_mesh_shap=(2, 2)? I think that can better show how the algorithm works?

Changed ici_mesh_shape

alanwaketan

I just noticed that most of the helpers @khatwanimohit you introduced are inspired by https://github.com/google/jax/blob/bfe8acb31e04a540daad3f568239ec0e5c3f0d0f/jax/experimental/mesh_utils.py. And in fact, all those helpers have a very nice docstring to explain what the helpers are doing.

I recommend next time if you are going to import some JAX utils to PyTorch/XLA, you'd better:

List the source on each utils you imported.
Import their docstring as well. Those are really critical for the readability of the code.

Also, have you checked the licenses to make sure that you can copy code from JAX into PyTorch/XLA? If not, I can do the research for you.

alanwaketan

Mostly looking good to me. Thanks, @khatwanimohit.

Please address the comments on readability.

alanwaketan · 2023-06-16T20:59:20Z

    return self.device_ids.reshape(self.mesh_shape)


+# HybridDevice class has been inspired from jax's mesh_utils: https://github.com/google/jax/blob/fc5960f2b8b7a0ef74dbae4e27c5c08ff1564cff/jax/experimental/mesh_utils.py#L4


Can you make it per helper that you imported?

alanwaketan · 2023-06-16T21:00:24Z

+      out[coords[0], coords[1], coords[2]] = d
+    return out
+
+  def _create_device_mesh_for_nd_torus(


You can add:

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L64.

alanwaketan · 2023-06-16T21:11:04Z

+    super().__init__(device_ids, mesh_shape, axis_names)
+
+  def _get_physical_tpu_mesh(self, devices: Sequence[Any]) -> np.ndarray:
+    r"""Rearrange TPU devices in a slice into a physical mesh."""


Can you add:
1.

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L172

The following description of the function:

r"""Rearrange TPU devices in a slice into a physical mesh. Args: devices: A list of device logical ordinals in a TPU slice. Returns: A np.ndarray of device logical ordinals with shape [global_x, global_y, global_z]. On v2 and v3, global_z is instead cores_per_chip (i.e., 2). """

alanwaketan · 2023-06-16T21:15:51Z

+        physical_mesh, mesh_shape)
+    return device_mesh
+
+  def _create_hybrid_device_mesh(self, ici_mesh_shape: Sequence[int],


Can you add:
1.

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L288.

And the follow function description:

"""Creates a device mesh for hybrid (e.g., ICI and DCN) parallelism. Args: ici_mesh_shape: shape of the logical mesh for the faster/inner network, ordered by increasing network intensity, e.g. [replica, data, mdl] where mdl has the most network communication requirements. dcn_mesh_shape: shape of the logical mesh for the slower/outer network, in the same order as mesh_shape. Returns: A np.ndarray of device logical ordinal with ici_mesh_shape * dcn_mesh_shape as its shape that can be fed into HybridMesh for hybrid parallelism. """

alanwaketan · 2023-06-16T21:31:02Z

+    return physical_mesh.transpose(transpose).reshape(mesh_shape), assignment
+
+  # This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L231
+  def _create_device_mesh(self,


I didn't mention this one given your logic is quite different. I suggest you can undo it.

Fixed the comment

alanwaketan

LGTM. Thanks, Mohit.

will-cromar · 2023-06-20T16:25:49Z

The TPU CI broke after this PR merged. Is this related?

Step #4 - "run_e2e_tests": ======================================================================
Step #4 - "run_e2e_tests": ERROR: test_hybrid_mesh_shape (__main__.BasicShardingTest)
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Traceback (most recent call last):
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/spmd/test_xla_sharding.py", line 462, in test_hybrid_mesh_shape
Step #4 - "run_e2e_tests":     hybrid_mesh = self._get_hybrid_mesh((1, self.n_devices))
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/spmd/test_xla_sharding_base.py", line 42, in _get_hybrid_mesh
Step #4 - "run_e2e_tests":     return xs.HybridMesh(ici_mesh_shape=ici_mesh_shape)
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 122, in __init__
Step #4 - "run_e2e_tests":     mesh = self._create_device_mesh(self.ici_mesh_shape)
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 257, in _create_device_mesh
Step #4 - "run_e2e_tests":     device_mesh, assignment = self._create_device_mesh_for_nd_torus(
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 220, in _create_device_mesh_for_nd_torus
Step #4 - "run_e2e_tests":     raise NotImplementedError(
Step #4 - "run_e2e_tests": NotImplementedError: Failed to find assignment for logical_axis_index 1 of size 8 with remaining assignable mesh [2, 2, 1]. The size of each axis in your logical mesh must be equal to the product of some subset of the physical mesh axis sizes. E.g logical mesh (4, 16) is compatible with physical mesh 4x4x4 since 4=4 and 16=4x4.
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Ran 26 tests in 0.968s
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": FAILED (errors=1)
Step #4 - "run_e2e_tests": [[0 1]
Step #4 - "run_e2e_tests":  [2 3]
Step #4 - "run_e2e_tests":  [4 5]
Step #4 - "run_e2e_tests":  [6 7]]
Step #4 - "run_e2e_tests": ++ kubectl get pod/xla-test-job-kl46l -o 'jsonpath={.status.containerStatuses[?(@.name=="xla-test")].state.terminated.exitCode}'

alanwaketan · 2023-06-20T19:26:30Z

Let's have a follow up to disable the test for TPU.

You can do that by following: https://github.com/pytorch/xla/blob/master/test/test_zero1.py#L13

hybrid_mesh class

b37d0e3

khatwanimohit requested review from jonb377 and yeounoh June 8, 2023 20:33

jonb377 reviewed Jun 8, 2023

View reviewed changes

Comment thread torch_xla/experimental/xla_sharding.py Outdated

Comment thread torch_xla/experimental/xla_sharding.py Outdated

Comment thread torch_xla/experimental/xla_sharding.py

Comment thread torch_xla/experimental/xla_sharding.py Outdated

khatwanimohit added 2 commits June 9, 2023 00:39

hybrid_mesh class

3bd182c

remove inherited fields

ad91169

jonb377 reviewed Jun 9, 2023

View reviewed changes

Comment thread torch_xla/experimental/xla_sharding.py Outdated

Comment thread torch_xla/experimental/xla_sharding.py Outdated

alanwaketan reviewed Jun 9, 2023

View reviewed changes

khatwanimohit and others added 5 commits June 10, 2023 00:41

hybrid_mesh class

7b264ca

lint fix

9f6d86c

Merge branch 'master' into mohit/hybrid_mesh

8f55df8

fix test

c457f6c

fix lint

d71df3a

alanwaketan reviewed Jun 12, 2023

View reviewed changes

khatwanimohit added 2 commits June 13, 2023 18:54

add unit test

ef665e9

skip test

9c6d8ab

khatwanimohit force-pushed the mohit/hybrid_mesh branch from ce4f052 to 9c6d8ab Compare June 13, 2023 21:30

jonb377 reviewed Jun 13, 2023

View reviewed changes

Comment thread test/spmd/test_xla_sharding.py Outdated

fix test

632cbbb

alanwaketan reviewed Jun 14, 2023

View reviewed changes

alanwaketan requested changes Jun 15, 2023

View reviewed changes

add comments

572548b

khatwanimohit force-pushed the mohit/hybrid_mesh branch from 79336d3 to 572548b Compare June 16, 2023 17:35

alanwaketan reviewed Jun 16, 2023

View reviewed changes

comments

640d0b3

alanwaketan reviewed Jun 16, 2023

View reviewed changes

fix comment

abf04dc

alanwaketan approved these changes Jun 16, 2023

View reviewed changes

khatwanimohit merged commit 60a6d60 into master Jun 19, 2023

ManfeiBai pushed a commit that referenced this pull request Jun 22, 2023

[SPMD] Hybrid Device mesh creation (#5147)

2a2e563

		return self.device_ids.reshape(self.mesh_shape)


		# HybridDevice class has been inspired from jax's mesh_utils: https://github.com/google/jax/blob/fc5960f2b8b7a0ef74dbae4e27c5c08ff1564cff/jax/experimental/mesh_utils.py#L4

Conversation

khatwanimohit commented Jun 8, 2023

Uh oh!

khatwanimohit commented Jun 8, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alanwaketan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

will-cromar commented Jun 20, 2023

Uh oh!

alanwaketan commented Jun 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alanwaketan left a comment •

edited

Loading