Optimize execution for ops that have multiple output in eager mode by JackCaoG · Pull Request #7680 · pytorch/xla

JackCaoG · 2024-07-12T23:28:34Z

In eager mode the execution happens when we create an XLATensor with IR, we will use the IR as the root to build/execute the graph.

This is mostly fine but for ops that has multiple outputs(like native_batch_norm), most of the outputs share a good amounts of common HLOs. It will be much faster to execute all of them in a single graph. The eager mode in PyTorch/XLA can't really execute HLO one by one, so the goal is to execute once(ideally) for each pytorch op.

The change in this pr will

delay the eager execution for some ops when they creating new XLAtensor with IRs
execute the HLO for all XLAtensors after they are created.

I will take another round to check I didn't mess up anything but would appreciate if someone can look closely at my change inside tensor_method.cpp.

JackCaoG · 2024-07-12T23:49:33Z

I also intentionally didn't handle the collectives. Collective will return a all_reduce token which we actually don't want to execute in eager case. I will handle that in a separate pr.

aws-rhsoln · 2024-07-13T02:49:01Z

Curious how much perf boost do we expect when we fuse them into a single graph?

JackCaoG · 2024-07-15T18:37:00Z

Curious how much perf boost do we expect when we fuse them into a single graph?

for a test code

torch_xla.experimental.eager_mode(True)

device = torch_xla.device()
m = nn.BatchNorm2d(16).to(device)
m.train()
input = torch.randn(16, 16, 1024, 1024, device=device)

start = time.time()
for _ in range(20):
  input = m(input)
xm.wait_device_ops()
end = time.time()
duration = end - start
print(f"total time = {duration}")

with my change total time = 0.46190381050109863, without this change total time = 14.28174352645874. I actually don;t know why it is 28x faster, but I did verified that in HLO without my change BatchNorm2d will compute the result one by one.

JackCaoG · 2024-07-15T22:59:17Z

@alanwaketan @wonjoolee95 This one is ready for review.

Optimize execution for ops that have multiple output in eager mode

e36f0da

JackCaoG added the eager PyTorch/XLA eager-mode label Jul 12, 2024

JackCaoG commented Jul 12, 2024

View reviewed changes

Comment thread torch_xla/csrc/tensor_methods.cpp Outdated

fix comment

eecda39

JackCaoG mentioned this pull request Jul 13, 2024

[RFC] PyTorch/XLA eager mode as default #7253

Open

JackCaoG marked this pull request as ready for review July 15, 2024 18:37

JackCaoG requested review from alanwaketan and wonjoo-wj July 15, 2024 18:37

wonjoo-wj approved these changes Jul 16, 2024

View reviewed changes

JackCaoG merged commit b2c7f65 into master Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize execution for ops that have multiple output in eager mode#7680

Optimize execution for ops that have multiple output in eager mode#7680
JackCaoG merged 2 commits intomasterfrom
JackCaoG/eager_faster

JackCaoG commented Jul 12, 2024

Uh oh!

Uh oh!

JackCaoG commented Jul 12, 2024

Uh oh!

aws-rhsoln commented Jul 13, 2024

Uh oh!

JackCaoG commented Jul 15, 2024 •

edited

Loading

Uh oh!

JackCaoG commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JackCaoG commented Jul 12, 2024

Uh oh!

Uh oh!

JackCaoG commented Jul 12, 2024

Uh oh!

aws-rhsoln commented Jul 13, 2024

Uh oh!

JackCaoG commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JackCaoG commented Jul 15, 2024 •

edited

Loading