[SPMD][PoC] compile & execute with PjRt by yeounoh · Pull Request #3684 · pytorch/xla

yeounoh · 2022-07-06T01:14:17Z

This is a follow-up to #3476 and contributes to #3871. The changes include:

Compile partitioned HLO computation graph with sharding annotations.
PjRtComputationClient integration to support SPMD sharded operations.
PjRtShardedData struct to represent sharded Data.
InputHandler for parameter sharding and sharded data transfer.
Remove duplicate copies of sharding annotations.
ExecuteReplicated for partitioned computation.

The PoC implementation supports replicated and tiled sharding annotations, and single-host xla:tpu backend. This enables a simple sharded computation on v3-8, like

    t1 = torch.randn(1, 128, device='cpu')
    t2 = torch.randn(1, 128, device='cpu')
    expected = t1 @ t2.T

    xt1 = t1.to(xm.xla_device())
    xt2 = t2.to(xm.xla_device())
    xs.mark_sharding(xt1, (1, 8), (0, 1))
    self.assertEqual('{devices=[1,8]0,1,2,3,4,5,6,7}',
                     torch_xla._XLAC._get_xla_sharding_spec(xt1))

    actual = (xt1 @ xt2.T).cpu()
    self.assertTrue(torch.allclose(expected, actual))

* Add device assignment for SPMD compilation

yeounoh · 2022-10-13T15:10:26Z

CPU test passes, but the GPU fails with the following somewhat unrelated (at least on the outset) error:

*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	
	
	gsignal
	abort
	
	xla::XrtLocalService::XrtLocalService(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
	xla::XrtComputationClient::MaybeCreateLocalService(xla::XrtComputationClient::Options const&)
	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
	xla::ComputationClient::Create()
	
	
	xla::ComputationClient::Get()
	
	
	_PyMethodDef_RawFastCallKeywords
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	
	_PyObject_GenericGetAttrWithDict
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	
	_Py_UnixMain
	__libc_start_main
	
*** End stack trace ***
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_torch_distributed_multi_all_reduce_xla_backend.py", line 38, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.14-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 399, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 146, in join
    signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

The mater branch is green, though. cc @JackCaoG

JackCaoG · 2022-10-13T16:30:17Z

Seesm irrelevant, let me just restart the gpu ci

JackCaoG · 2022-10-13T20:36:51Z

I will take another pass and try to merge it.

yeounoh · 2022-10-13T21:36:03Z

Seesm irrelevant, let me just restart the gpu ci

Yea, this one succeeded. Thanks @JackCaoG

JackCaoG · 2022-10-13T22:56:38Z

+    expected = t + t
+
+    xt = t.to(xm.xla_device())
+    n_devices = xm.xrt_world_size()


Does CI run this test or we only run it on TPU?

We only run cpp tests -- covers the internal changes that affects the non-spmd code paths -- and the python API tests are disabled link. I will re-enable it after debugging/ adding the API unit tests.

JackCaoG · 2022-10-13T22:57:05Z

  virtual void TransferToServer(absl::Span<const TensorSource> tensors,
                                absl::Span<const DataPtr> datas) = 0;

+  // Transfers local sharded tensor values to the TPU servers and returns a


I would use TPU Device instead of TPU Server, there is no server in PJRT context.

JackCaoG · 2022-10-14T00:46:44Z

 void XLATensor::SetShardingSpec(const ShardingSpec& sharding_spec) {
  XLA_CHECK(GetIrValue().node != nullptr) << "Tyring to access a null cursor";
-  dynamic_cast<XlaNode*>(data()->ir_value.node.get())
+  dynamic_cast<XlaNode*>(GetIrValue().node.get())


hmm, we should add a XlaNodeCast to replace dynamic_cast<XlaNode*> so it is cleaner

I see, I normally prefer more explicit type identifiers especially for casting (similar to avoid using auto too much).

JackCaoG · 2022-10-14T00:48:41Z

+  // TODO(yeounoh): Sharding annotation must be removed by explicit call to
+  // ClearSharding.
+  ShardingSpecPtr sharding = sharding_spec();
+  if (sharding != nullptr) {


we need a test for this. For example when we deep copy a tensor with sharding, the result tensor should also have sharding. Something similar to

xla/test/test_operations.py

Line 1670 in b3f79cc

y = copy.deepcopy(x)

@steventk-g can you add a test case?

Yep, I've created an issue to track it #4095

Good point, @steventk-g let me handle this if you haven't already started.

JackCaoG · 2022-10-14T00:56:02Z


  auto cached_computation = std::make_shared<CachedComputation>(
-      std::move(compile_result.computation));
+      std::move(compile_result.computation), compile_result.is_sharded);


why do we need is_sharded separatelly in CachedComputation?

We could pass around is_sharded between APIs, or wrap it inside the CachedComputation. Is sharded is later needed for the execution (will be associated with the cached computation only), and the latter doesn't require changing the function APIs here and there.

JackCaoG

Mostly LGTM, I had a question regarding ExecuteReplicated in #3684 (comment). If we can align on that this pr is ready to merge.

JackCaoG

Thanks @yeounoh ! I will merge this pr to unblock @steventk-g

yeounoh added DO_NOT_MERGE Not for merging. distributed SPMD and other distributed things. labels Jul 6, 2022

yeounoh self-assigned this Jul 6, 2022

yeounoh marked this pull request as draft July 6, 2022 01:14

yeounoh force-pushed the xla_spmd_pjrt_integration branch 5 times, most recently from d38bd7d to 09f4640 Compare July 11, 2022 07:18

yeounoh force-pushed the xla_spmd_pjrt_integration branch from 09f4640 to 5e07428 Compare July 13, 2022 20:13

yeounoh force-pushed the xla_spmd_pjrt_integration branch 4 times, most recently from c9399ac to 91262bc Compare July 23, 2022 00:14

ronghanghu mentioned this pull request Jul 23, 2022

[RFC] A high-level GSPMD API in PT/XLA (based on xs.mark_sharding) #3755

Open

yeounoh force-pushed the xla_spmd_pjrt_integration branch 15 times, most recently from 0ca964c to c26f94b Compare July 26, 2022 04:28

yeounoh added 16 commits October 12, 2022 23:10

Use unpartitioned tensor shape and device for PjRtShardedData.

922396d

Disable partial sharding in mark_sharding

a10504d

Remove duplicate copies of sharding annotation

104e9cc

Allow ToHlo to return partitioned HLO if sharded

f85716a

Fix lint errors

8141751

* Add/expand CreateTensorsData & InputHandler tests

90a1a09

* Add device assignment for SPMD compilation

[SPMD] Refactor _xla_partitioning_pass.

f43ca39

[SPMD] Refactor _xla_mark_sharding.

be26b23

[SPMD] Support higher-order mesh topology.

c9bbf16

[SPMD] inherit global tensor requires_grad in XLAShardedTensor

468fbb3

[SPMD] Disable aliasing if is_spmd.

5db9148

[SPMD] Use all devices instead of local

df3ebd9

[SPMD] Define clear_sharding interface

2aec634

[SPMD] experiment with IR sharding preservation.

e56a620

Rebase master

2fc2da2

Refactor and add comments

bfa3b85

JackCaoG reviewed Oct 13, 2022

View reviewed changes

steventk-g mentioned this pull request Oct 13, 2022

Add 5D tiled test for ShardingUtil::ShardTensor #4093

Closed

JackCaoG reviewed Oct 14, 2022

View reviewed changes

steventk-g mentioned this pull request Oct 14, 2022

Test tensor deep copy with sharding #4095

Closed

JackCaoG approved these changes Oct 17, 2022

View reviewed changes

yeounoh mentioned this pull request Nov 1, 2022

[SPMD] Test sharding_spec with clear_sharding, deepcopy #4144

Merged

Conversation

yeounoh commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeounoh commented Oct 13, 2022

Uh oh!

JackCaoG commented Oct 13, 2022

Uh oh!

JackCaoG commented Oct 13, 2022

Uh oh!

yeounoh commented Oct 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steventk-g Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yeounoh commented Jul 6, 2022 •

edited

Loading

steventk-g Oct 14, 2022 •

edited

Loading

JackCaoG left a comment •

edited

Loading