[SPMD] Support SPMDFullToShardShape by alanwaketan · Pull Request #6922 · pytorch/xla

alanwaketan · 2024-04-12T21:34:03Z

Summary:
This pull request supports SPMDFullToShardShape which is a custom op that opens a region for non-partitioned graph in SPMD program. It will stop SPMD auto sharding and partition in that region and therefore allows manual sharding like cc ops.

To implement it, this pull request expands CustomSharding node to accept a new type. To be notice, the output shape of the op needs to be the shard shape of the input, and the node needs to have manual sharding annotation.

Test Plan:
PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_spmd_full_to_shard_shape

JackCaoG · 2024-04-12T22:12:16Z

  return xtensors;
 }

+bool IsIr(const at::Tensor& tensor) {


nit, IsNonDeviceDataIR

JackCaoG · 2024-04-12T22:18:45Z

+    # It looks like XLA does't like only having manual sharding in the HLO.
+    # It needs to be paired with SPMDFullToShardShape/SPMDShardToFullShape.
+    # The following exception cannot be caught somehow.
+    # xx.cpu()


do you intend to keep this xx.cpu?

Yea, it's more like a note that this won't work... I was trying to use with self.assertRaises but that doesn't capture the exception... I have noticed this before too. When libtpu crashed, it's hard to catch it in the py level. Not sure why. Maybe you have some better ideas?

oh I think I run into similar issue before.. The way I handle it was ugly through

xla/test/spmd/test_dynamo_spmd.py

Lines 172 to 181 in a7a1357

# crash will hapeen in a async execution thread, need to grab the lock again to

# surface that exception

dynamo_res = dynamo_linear(xla_x)

try:

print(dynamo_res)

except:

print('catch')

# it is hard to catch the C++ runtime error in python, instead we can check if

# after printing that dynamo_res is still a placeholder then it means C++ crashed.

self.assertTrue(torch_xla._XLAC._is_placecholder(dynamo_res))

C++ crash on pt level can be caught with self.assertRaise but not libtpu level.... I'm not sure why... yea, not even with this hack...

cc @will-cromar Do you know how to catch libtpu exception on py? Appreciate your insights.

I don't think you can. To make a proper runtime error, you have to raise an exception, and Google internal binaries don't generally do that. I wrote about a similar case in #6700 (comment)

Thanks, Will. That makes a lot of sense now.

JackCaoG · 2024-04-12T22:21:56Z

    tensor_methods::custom_sharding_(output_tensor,
-                                     input_tensor->sharding_spec());
+                                     input_tensor->sharding_spec(),
+                                     CustomSharding::Type::kSharding);


so you assume only tensor with kSharding will be called with in place ops?

That's the original design which is to align with the original design of SPMD... So yea.. for kSharding...

can we make kSharding to be default then? This way most people reading this code won't need to figure out what kSharding actually means.

JackCaoG · 2024-04-12T22:32:07Z

+  enum class Type {
+    kSharding,
+    kSPMDFullToShardShape,
+    kSPMDShardToFullShape,
+  };


This enum is really confusing, can you add some comment around what they actually does? I was reading the SPMD code again, this op itself only means we want to shard the underlying value and the actual sharding resides in the XlaTensor or Based XLAIR object?

Right, this is just the name of the custom call. The sharding annotation is in XlaTensor as normal. I can add more explanations.

Maybe we can annotate explicilty that this is sharding type for custom call in the enum class name or somethinhg.

I guess the current approach sort of does it already? Can you be more specific? @yeounoh

I agree, Type is already defined under CustomSharding

yeounoh

LGTM, thanks!

alanwaketan · 2024-04-15T20:05:02Z

Thanks, Yeounoh!

Summary: This pull request supports SPMDFullToShardShape which is a custom op that opens a region for non-partitioned graph in SPMD program. It will stop SPMD auto sharding and partition in that region and therefore allows manual sharding like cc ops. To implement it, this pull request expands CustomSharding node to accept a new type. To be notice, the output shape of the op needs to be the shard shape of the input, and the node needs to have manual sharding annotation. Test Plan: PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_spmd_full_to_shard_shape

alanwaketan added 10 commits April 12, 2024 21:21

initial commit

74b494d

Fix linters

fa34ef6

Update comment

0e0ad18

Disallow mark manual sharding on data tensors

a112051

initial commit for pause

b3ee21a

expand custom sharding

cfa9c56

Make it even more extensible

a6d1be2

Make test case and fix it

3ececd3

Add more tests

d77f85c

Fix linters

ef834bf

alanwaketan requested review from jonb377 and yeounoh April 12, 2024 21:34

Fix rebase error

d9c3e24

JackCaoG reviewed Apr 12, 2024

View reviewed changes

alanwaketan added 3 commits April 15, 2024 07:02

address comments

52bf290

Make kSharding as default

c70900b

Fix linters

7098b3d

yeounoh approved these changes Apr 15, 2024

View reviewed changes

alanwaketan merged commit 2763248 into master Apr 15, 2024

baoleai mentioned this pull request Aug 6, 2024

Add manual sharding API for SPMD AlibabaPAI/xla#2

Merged

	# crash will hapeen in a async execution thread, need to grab the lock again to
	# surface that exception
	dynamo_res = dynamo_linear(xla_x)
	try:
	print(dynamo_res)
	except:
	print('catch')
	# it is hard to catch the C++ runtime error in python, instead we can check if
	# after printing that dynamo_res is still a placeholder then it means C++ crashed.
	self.assertTrue(torch_xla._XLAC._is_placecholder(dynamo_res))

Conversation

alanwaketan commented Apr 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Apr 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants