Enable SPMD + dynamo for inference by JackCaoG · Pull Request #5002 · pytorch/xla

JackCaoG · 2023-05-12T01:29:08Z

This work was done by @yeounoh and I am trying to land this pr in his behalf. The last attempt was made for @steventk-g in #4862.

Currently test failed with an Check failed: handle->HasValue(), so still WIP.

JackCaoG · 2023-05-13T00:28:32Z

OK There are 2 issues

Dynamo async function currently silently fail if there is an exception happened. we need to add a rethrow logic
dynamo's PjRtComputationClient::PjRtData::Assign(const Data& data) failed with bad_cast error

void PjRtComputationClient::PjRtData::Assign(const Data& data) {
  TF_VLOG(3) << "enter assign\n";
  const PjRtData& pjrt_data = dynamic_cast<const PjRtData&>(data);
  if (&pjrt_data != this) {
    buffer = pjrt_data.buffer;
  }
  TF_VLOG(3) << "left assign\n";
}

2023-05-13 00:26:38.471263: I third_party/xla_client/pjrt_computation_client.cc:135] enter assign

E
RuntimeError: std::bad_cast

JackCaoG · 2023-05-13T00:32:10Z

Ah Ok.. I think I know what's the problem, the result of the dynamo graph is a PjRtShardedData, and we tried to cast it to PjRtData. This might has to do with @jonb377 's recent pr that make most things implicitly replicated. This should be a easy fix, I can work on it next week.

FYI @yeounoh

yeounoh · 2023-05-15T17:32:49Z

Ah Ok.. I think I know what's the problem, the result of the dynamo graph is a PjRtShardedData, and we tried to cast it to PjRtData. This might has to do with @jonb377 's recent pr that make most things implicitly replicated. This should be a easy fix, I can work on it next week.

FYI @yeounoh

Thanks @JackCaoG , I am going to merge a output param sharding patch, which might change the code path a bit. Let's chat offline, I can explain further.

…aceholder if SPMD is enabled

JackCaoG · 2023-05-16T01:45:41Z

+          // Device will be Virtual device if SPMD is enabled.
+          torch::lazy::BackendDevice device =
+              ShardingUtil::UseVirtualDevice() ? ParseDeviceString("SPMD:0")
+                                               : torch_xla::GetCurrentDevice();


@yeounoh I am not sure if we should just update GetCurrentDevice, any thought? We need to sit down and think about how to surface this virtual device to user soon..

I voted for GetCurrentDevice as there might be other scenario where the caller will also need to distinguish SPMD:0 with XLA:0.

GetCurrentDevice is being used over 30 places in our code base now, mostly during tracing and caller trying to figure out the hw type. I think it should be fine as long as SPMD:0 can be resolved into correct hardware type. I would leave that in a separate pr since it touches too many codes and might introduce noise.

JackCaoG · 2023-05-16T18:21:01Z

I think this one is ready for review, I will add more test cases(input data sharding, which I am not sure if it works or not) and features in the next pr.

alanwaketan

Generally, LGTM.

alanwaketan · 2023-05-16T19:37:30Z

        WrapXlaData(xla::ComputationClient::Get()->CreateDataPlaceholder(
            device.toString(), std::move(shape)));
+    // if SPMD is enabled, we assume all output will be replicated
+    if (ShardingUtil::UseVirtualDevice()) {


Why we now start adding this for the dynamo path? We don't need this for the LTC path?

Looks like this patch is dynamo exclusive... Should we hint this somewhere?

the lazy code path already have this logic, in fact I copt this logic from lazy code path lol

I smell an opportunity to merge two code paths more. But let's do it in a follow up.

alanwaketan · 2023-05-16T19:41:45Z

      if (auto xla_tensor_ptr = bridge::TryGetXlaTensor(ivalue.toTensor())) {
        dataptr = xla_tensor_ptr->GetXlaData();
      } else {
+        XLA_CHECK(device.type() != (int8_t)XlaDeviceType::SPMD)


What's this XLA_CHECK for?

hmm, not sure, I copy this from @yeounoh 's diff. @yeounoh any idea?

It's not needed, but more for a sanity check I probably added to ensure that this doesn't happen. Basically, we want to make sure that the SPMD device type is always on the backend (device data).

jonb377

LGTM, thanks Jack

jonb377 · 2023-05-17T18:23:47Z

+    # Add an additional 1x1 layer at the end to ensure the final layer
+    # is not sharded.
+    self.fc3 = nn.Linear(1, 1)


Is this due to the lack of output sharding propagation?

yea, in this pr I tried to keep it that output is replicated. We can expand this after output sharding pr is ready.

yeounoh · 2023-05-17T20:31:06Z

I think this one is ready for review, I will add more test cases(input data sharding, which I am not sure if it works or not) and features in the next pr.

Input sharding should (used to) work if the sharded input is used for the torch compilation. Let me know. Will take a pass on the chages now as well, thanks.

yeounoh · 2023-05-17T20:35:10Z

@@ -590,6 +593,15 @@ XLAGraphExecutor::ExecuteComputationWithBarrier(
    torch::lazy::BackendDataPtr handle =
        WrapXlaData(xla::ComputationClient::Get()->CreateDataPlaceholder(


If it's SPMD virtual device, then we should always use PjRtShardedData handle.

hmm, is the logic below to call WrapDataShards not enough? This code path is shared between spmd and non-spmd code path.

JackCaoG added 2 commits May 12, 2023 00:00

initial code change

2417643

Add simple test, which currently failed

8185065

JackCaoG added the distributed SPMD and other distributed things. label May 12, 2023

JackCaoG mentioned this pull request May 12, 2023

[WIP] Integrate Dynamo + SPMD for Inference #4862

Closed

JackCaoG added the dynamo label May 13, 2023

yeounoh self-requested a review May 15, 2023 17:31

Add try catch to dynamo, use SPMD device in dynamo, create sharded pl…

89113ca

…aceholder if SPMD is enabled

JackCaoG commented May 16, 2023

View reviewed changes

Comment thread test/spmd/test_dynamo_spmd.py

JackCaoG marked this pull request as ready for review May 16, 2023 18:20

JackCaoG changed the title ~~[WIP] Enable SPMD + dynamo for inference~~ Enable SPMD + dynamo for inference May 16, 2023

JackCaoG requested review from alanwaketan, jonb377 and wonjoo-wj May 16, 2023 18:20

JackCaoG commented May 16, 2023

View reviewed changes

Comment thread test/spmd/test_dynamo_spmd.py

alanwaketan approved these changes May 16, 2023

View reviewed changes

fix review comments

e47acfa

JackCaoG requested a review from alanwaketan May 17, 2023 00:05

jonb377 approved these changes May 17, 2023

View reviewed changes

alanwaketan approved these changes May 17, 2023

View reviewed changes

yeounoh reviewed May 17, 2023

View reviewed changes

Comment thread torch_xla/csrc/xla_graph_executor.cpp

change comments

2a8cc73

JackCaoG merged commit dc057dc into master May 18, 2023

wonjoo-wj mentioned this pull request Jul 14, 2023

[WIP] Materialize SPMD tensors during Dynamo graph compilation #4859

Closed

		@@ -590,6 +593,15 @@ XLAGraphExecutor::ExecuteComputationWithBarrier(
		torch::lazy::BackendDataPtr handle =
		WrapXlaData(xla::ComputationClient::Get()->CreateDataPlaceholder(

Conversation

JackCaoG commented May 12, 2023

Uh oh!

JackCaoG commented May 13, 2023

Uh oh!

JackCaoG commented May 13, 2023

Uh oh!

yeounoh commented May 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackCaoG commented May 16, 2023

Uh oh!

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh commented May 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alanwaketan May 16, 2023 •

edited

Loading