[WIP] Integrate Dynamo + SPMD for Inference by steventk-g · Pull Request #4862 · pytorch/xla

steventk-g · 2023-04-07T21:50:10Z

Includes @yeounoh's changes to use SPMD with torch.compile

yeounoh · 2023-04-10T16:37:58Z

+  if FLAGS.fake_data:
+    assert FLAGS.test_set_batch_size == 1
+    test_loader = xu.SampleGenerator(
+        data=(torch.zeros(FLAGS.test_set_batch_size, 3, img_dim, img_dim).to(device),


Maybe we comment on that we needed to(device) here for spmd+dynamo?

yeounoh

Mostly LGTM, added a few minor comments. Also we need to address linter checks.

yeounoh · 2023-04-13T15:55:36Z

+  torch.manual_seed(42)
+
+  model = torchvision.models.resnet50().to(
+      device)  # get_model_property('model_fn')().to(device)


Need to clean this up, either use get_model_property('model_fn')().to(device) or stick to resnet50(), in which case we don't need some of the model props code above.

yeounoh · 2023-04-13T15:56:32Z


 import torch_xla.core.dynamo_bridge as bridge
 import torch_xla.core.xla_model as xm
+import torch_xla.experimental.xla_sharding as xs


this is needed?

yeounoh · 2023-04-13T15:57:40Z

      python3 pytorch/xla/test/spmd/test_xla_virtual_device.py
-      python3 pytorch/xla/test/spmd/test_train_spmd_linear_model.py
+      XLA_USE_SPMD=1 python3 pytorch/xla/test/spmd/test_train_spmd_linear_model.py --sharding batch
+      XLA_USE_SPMD=1 python3 pytorch/xla/test/spmd/test_inference_spmd_dynamo_imagenet.py  --fake_data --use_dynamo --sharding batch


awesome, thank you!

How long will this test take? It is inference so I guess it won't be too bad but I am trying to prevent us from adding too many burden to the already very long CI.

yeounoh · 2023-04-13T15:57:55Z



 def is_xla_tensor(tensor: torch.Tensor) -> bool:
+  # TODO(yeounoh) check if tensor sharding annotation can be accessed here


I think we can remove my comment here now.

yeounoh

Added more comments

JackCaoG · 2023-04-18T21:43:59Z

+  xm.mark_step()
+  xm.wait_device_ops()
+  met.clear_all()


why are these needed? I had these in my dynamo test because test data is fake so I need a mark_step to materalize the fake input, if that;'s not the case here we don't need to do a mark_step here.

We need to materialize the input here as well. @yeounoh may be able to add more context

ok, maybe add a comment above to mentioned that this mark_step is to materialize the fake input then.

JackCaoG · 2023-04-19T01:12:26Z

      &coll, std::move(arguments), placeholders, std::move(cachedComputation));

-  auto syncfn = [async, hash]() {
+  auto syncfn = [async, hash, sharding_specs]() {


why sharding_specs needs to be passed from original function? It is init as default value right? We might as well just init it within this function.

ok, I guess the point here is we need to supply a proper sharding spec for the output, but it is currently not supported yet. As a follow up, what's the right way to determine the sharding spec for output?

Good question, I'm not sure the right way to determine sharding spec for output. @yeounoh might be able to answer or provide more context

That's fine, we will leave that as a follow up.

JackCaoG

mostly LGTM.

FYI @wonjoolee95 in this pr we break a really important assumption of the dynamo which is "same fx graph will result in same IR/HLO graph". In the SPMD case, even if the fx graph is the same, we can annotate the input XLATensor and make it generate different IR/HLO graph.

There is one thing that is still unclear to me and I hope @yeounoh and maybe @alanwaketan can give me some answer. AFAIK, during the inital phase of the dynamo extract_compile_graph, dynamo will pass us the fx graph along with the fake XLATensor with same shape with the real model. Does dyanmo somehow correctly move the sharding annotation from the real XLATensor to the Fake XLATensor? Otherwise I would expect we hit a cache miss when we execute the real SPMD computation.

…rding

JackCaoG · 2023-04-25T21:00:27Z

@steventk-g let me know when you want me to take another look

JackCaoG · 2023-05-02T00:51:59Z

I am taking over this commit, since it is on @steventk-g 's fork and it already has conflcits with the master, I will just open a new pr and patch from master.

alanwaketan · 2023-05-02T18:39:53Z

I am taking over this commit, since it is on @steventk-g 's fork and it already has conflcits with the master, I will just open a new pr and patch from master.

Let me know when it's ready. I can review it.

JackCaoG · 2023-05-12T01:29:56Z

Close this one in favor of #5002

steventk-g changed the title ~~Integrate Dynamo + SPMD for Inference~~ [WIP] Integrate Dynamo + SPMD for Inference Apr 7, 2023

steventk-g requested a review from yeounoh April 7, 2023 21:51

yeounoh reviewed Apr 10, 2023

View reviewed changes

Comment thread test/spmd/test_dynamo_spmd_inference_latency.py Outdated

yeounoh reviewed Apr 10, 2023

View reviewed changes

yeounoh suggested changes Apr 10, 2023

View reviewed changes

steventk-g force-pushed the steven_spmd_inference branch 2 times, most recently from fdcf5f0 to a1e0c14 Compare April 11, 2023 23:42

yeounoh reviewed Apr 13, 2023

View reviewed changes

Comment thread test/spmd/test_inference_spmd_dynamo_imagenet.py Outdated

yeounoh reviewed Apr 13, 2023

View reviewed changes

Comment thread test/dynamo/test_bridge.py Outdated

yeounoh reviewed Apr 13, 2023

View reviewed changes

Comment thread test/dynamo/test_bridge.py Outdated

yeounoh reviewed Apr 13, 2023

View reviewed changes

yeounoh suggested changes Apr 13, 2023

View reviewed changes

JackCaoG reviewed Apr 18, 2023

View reviewed changes

Comment thread test/spmd/test_inference_spmd_dynamo_imagenet.py

JackCaoG reviewed Apr 19, 2023

View reviewed changes

Comment thread test/spmd/test_inference_spmd_dynamo_imagenet.py

JackCaoG reviewed Apr 19, 2023

View reviewed changes

yeounoh and others added 10 commits April 24, 2023 16:26

[SPMD] integrate with Dynamo

4348082

Test dynamo bridge with input (batch, spatial) and model (linear) sha…

3f06b53

…rding

Add ResNet & MNIST inference latency tests

e721594

Add latency experiment script

c96e73c

Move fake data to device

4bd1be5

Remove old test

a9103ae

Fix merge conflict

df715f9

Formatting

73e59d2

Update xla_test_job.yaml

cf0bf12

Addressing some comments

cf1f0f4

steventk-g force-pushed the steven_spmd_inference branch from d774de0 to cf1f0f4 Compare April 24, 2023 16:26

Fix data loading

41ce687

root added 2 commits April 26, 2023 18:17

Add unit test

d328b26

Try model test and add debugging

50524d4

JackCaoG mentioned this pull request May 12, 2023

Enable SPMD + dynamo for inference #5002

Merged

JackCaoG closed this May 12, 2023



		def is_xla_tensor(tensor: torch.Tensor) -> bool:
		# TODO(yeounoh) check if tensor sharding annotation can be accessed here

Conversation

steventk-g commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yeounoh Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Apr 25, 2023

Uh oh!

JackCaoG commented May 2, 2023

Uh oh!

alanwaketan commented May 2, 2023

Uh oh!

JackCaoG commented May 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steventk-g commented Apr 7, 2023 •

edited

Loading

yeounoh Apr 10, 2023 •

edited

Loading

yeounoh left a comment •

edited

Loading