Enable eager spmd by JackCaoG · Pull Request #7341 · pytorch/xla

JackCaoG · 2024-06-24T23:05:46Z

No description provided.

alanwaketan · 2024-06-26T19:16:33Z

How could SPMD possibly work for eager mode?

JackCaoG · 2024-06-26T19:51:53Z

How could SPMD possibly work for eager mode?

consider eager mode as calling mark_step after every pytorch op.

alanwaketan · 2024-06-26T20:08:45Z

How could SPMD possibly work for eager mode?

consider eager mode as calling mark_step after every pytorch op.

Then how sharding propogation and auto partition work? I assume they don't carry states from last graph?

JackCaoG · 2024-06-26T21:19:06Z

How could SPMD possibly work for eager mode?

consider eager mode as calling mark_step after every pytorch op.

Then how sharding propogation and auto partition work? I assume they don't carry states from last graph?

The sharding propogation and auto partition still happening within the subgraph we compile. For example

t3 = t2.cos(t1)
t3 += t2

we will compile a graph for cos which will calculate the output sharding for t3 and then assign a PJRT sharded buffer that's not ready to t3. We will then just proceed with another graph with add and now we know the input sharding for t3, then we will just propagate that to the output.

alanwaketan · 2024-06-26T21:34:37Z

Okay, that's fair.

alanwaketan · 2024-06-26T21:35:21Z

  XLATensor::ShardingSpecPtr sharding = input_tensor->sharding_spec();
-  if (sharding && sharding->sharding.type() != xla::OpSharding::UNKNOWN) {
+  // don't propagate sharding in eager mode.
+  if (!XLAGraphExecutor::Get()->UseEagerMode() && sharding &&


May I ask why?

It complained about the output tensor already has a sharding and we can't propagate to it. This happens in the backward. I didn't spend enough time to debug it but I don't expect user to actually run eager mode with step fn(forward and backward), I only expect them to run it with some data preprocessing on device so I just quickly unblock myself.

alanwaketan

LGTM. Thanks, Jack!

…ng with fsdpv2 #7631 (#7673)

JackCaoG added eager PyTorch/XLA eager-mode distributed SPMD and other distributed things. labels Jun 24, 2024

JackCaoG added 4 commits June 25, 2024 22:54

Enable eager spmd

2c67e1d

add basic test for eager + spmd

3dfd628

linter

60504a8

fix test

4244dcb

JackCaoG force-pushed the JackCaoG/eager_spmd branch from 3ec9f92 to 4244dcb Compare June 25, 2024 22:55

JackCaoG requested review from alanwaketan, jonb377 and wonjoo-wj June 26, 2024 17:16

JackCaoG marked this pull request as ready for review June 26, 2024 18:30

alanwaketan reviewed Jun 26, 2024

View reviewed changes

alanwaketan approved these changes Jun 26, 2024

View reviewed changes

JackCaoG merged commit d5e5713 into master Jun 26, 2024

JackCaoG mentioned this pull request Jul 2, 2024

2.4 backport PR request list #7242

Closed

JackCaoG added a commit that referenced this pull request Jul 12, 2024

Enable eager spmd (#7341)

c00762d

bhavya01 pushed a commit that referenced this pull request Jul 15, 2024

Backport Enable eager spmd #7341 and fix eager mode spmd module loadi…

c4daf6a

…ng with fsdpv2 #7631 (#7673)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable eager spmd#7341

Enable eager spmd#7341
JackCaoG merged 4 commits intomasterfrom
JackCaoG/eager_spmd

JackCaoG commented Jun 24, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

JackCaoG commented Jun 26, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

JackCaoG commented Jun 26, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

alanwaketan Jun 26, 2024

Uh oh!

JackCaoG Jun 26, 2024

Uh oh!

alanwaketan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JackCaoG commented Jun 24, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

JackCaoG commented Jun 26, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

JackCaoG commented Jun 26, 2024

Uh oh!

alanwaketan commented Jun 26, 2024

Uh oh!

alanwaketan Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants