IFRT prototype by will-cromar · Pull Request #5677 · pytorch/xla

will-cromar · 2023-10-05T17:40:52Z

Implement new ComputationClient with IFRT, which currently just wraps PJRT.
- Enable IFRT wrapper with XLA_USE_IFRT=1
Move PJRT client initialization into initialize_pjrt.cc/h since IFRT wraps the same PjRtClient.
Minor fixes to existing pjrt_computation_client:
- Make spmd_device_str const
- Change the owned PjRtClient to a unique_ptr. Only SE:TPU required us to use shared_ptr, which is now removed.

There's still some opportunities to refactor common functionality up to ComputationClient, but I'm trying to minimize changes to the high-level API for this PR.

IFRT is still highly experimental. Use at your own risk. See my comments below for caveats and limitations.

will-cromar · 2023-10-19T23:43:07Z

At this point, I am able to run our SPMD ResNet50 example successfully, but it's extremely slow. Any case where I resharded an array, I chose "copy" semantics for safety. I'll have to take another pass and carefully think through the ownership of the underlying data, since the excessive copies are likely contributing to poor performance.

Sharded execution does not appear to be implemented at all within the IFRT/PJRT wrapper, so I marked that method unimplemented for now. Likewise, I xla::ifrt::HloSharding (a wrapper over xla::HloSharding) is not implemented in the wrapper, so we'll have to carry the xla::OpSharding in our own object for now.

Dynamic shape is currently unsupported by xla::ifrt::Shape.

Until we have feature parity, I will keep IFRT as a separate ComputationClient implementation so we don't break PJRT.

will-cromar · 2023-12-01T00:00:14Z

Coming back to this PR (finally) after merging supporting changes in separate PRs. Performance is significantly better after rebasing -- it only lags PJRT by ~10% on ResNet50 now, compared to 80% in my first draft. There's still room for optimization, particularly around reducing the number of copies used when transforming IFRT arrays.

IFRT is still highly experimental in this state. Known outstanding issues other than performance:

The IFRT/PJRT wrapper now natively supports xla::HloSharding, but I'm still getting errors at the current XLA pin. I'd like to try again after another pin update since this support was only added in October.
Sharded execution (aka multiprocess) is completely unsupported in IFRT.
Dynamic shapes can't be represented by xla::ifrt::Shape.

I'll clean up this PR and send it for review as an optional/experimental setting.

will-cromar · 2023-12-01T19:00:45Z

Performance on LLama 7B is not bad! It's somewhere between PJRT now and PJRT before I started working on some optimizations this month:

Totally decoded 1007 tokens in 7.03615 seconds

JackCaoG · 2023-12-04T20:46:10Z

If you don't intend to merge this for 2.2 release, I will hold on the review until the branch cut.

will-cromar · 2023-12-04T20:58:14Z

Merging after the cut sounds good to me. This won't be useful in the 2.2 release.

JackCaoG · 2023-12-12T21:14:05Z

I will take a look today

JackCaoG · 2023-12-13T00:12:10Z

+
+// Builds a map from the device's global ordinal to its index in the `devices`
+// array.
+std::unordered_map<int, int> build_index_map(


for these utils, can we share between pjrt and ifrt? so they are actually different?

This was intentional. I wanted to minimize changes to common code, and ideally we would be able to remove the PJRT computation client at some point. Would you prefer I try to factor out all common functionality in this PR?

JackCaoG · 2023-12-13T00:14:16Z

+}
+
+IfrtComputationClient::IfrtComputationClient() {
+  std::string device_type = sys_util::GetEnvString(env::kEnvPjRtDevice, "");


let's do a grep of pjrt in this file and replace them with ifrt. Through I am curious did you intend to query the EnvPjrtDevice here?

Yeah, see #5677 (comment)

JackCaoG · 2023-12-13T00:16:49Z

+    const std::vector<DataPtr>& shards, std::string device, xla::Shape shape,
+    xla::OpSharding sharding) {
+  // TODO: implement CreateDataPlaceholder for sharded data
+  if (shards.size() == 0) {


is there a legit use case of shards.zie() == 0?

This is how sharded data placeholders get created right now:

xla/torch_xla/csrc/xla_sharding_util.cpp

Lines 590 to 593 in a2f80e4

auto sharded_data_placeholder =

runtime::GetComputationClient()->WrapDataShards(

{}, GetVirtualDevice().toString(), sharding_specs[i]->shape,

sharding_specs[i]->sharding);

I'd rather update CreateDataPlaceholder to take a sharding and make it more explicit

This reverts commit 7d52f67.

will-cromar added DO_NOT_REVIEW_YET runtime labels Oct 5, 2023

yeounoh reviewed Oct 13, 2023

View reviewed changes

Comment thread torch_xla/csrc/runtime/ifrt_computation_client.h

yeounoh reviewed Oct 13, 2023

View reviewed changes

Comment thread torch_xla/csrc/xla_sharding_util.cpp Outdated

will-cromar mentioned this pull request Oct 26, 2023

Refactor ExecuteReplicated to operate on sharded data directly #5737

Merged

will-cromar force-pushed the wcromar/ifrt-prototype branch from f58cba4 to 40ab61d Compare November 28, 2023 19:25

will-cromar changed the base branch from master to wcromar/refactor-execute-replicated November 28, 2023 19:26

will-cromar force-pushed the wcromar/ifrt-prototype branch from bbde87f to f411a37 Compare November 29, 2023 22:03

will-cromar force-pushed the wcromar/refactor-execute-replicated branch from deed402 to 9bc594e Compare November 30, 2023 17:49

will-cromar force-pushed the wcromar/ifrt-prototype branch from e4856a3 to 1855086 Compare December 1, 2023 00:03

will-cromar changed the base branch from wcromar/refactor-execute-replicated to master December 1, 2023 00:03

will-cromar changed the title ~~[WIP] IFRT prototype~~ IFRT prototype Dec 1, 2023

will-cromar removed the DO_NOT_REVIEW_YET label Dec 1, 2023

will-cromar requested review from JackCaoG, jonb377 and yeounoh December 1, 2023 00:12

will-cromar marked this pull request as ready for review December 1, 2023 19:00