Dynamo Poc by JackCaoG · Pull Request #4119 · pytorch/xla

JackCaoG · 2022-10-25T05:20:50Z

This pr is based on @shunting314 branch at https://github.com/shunting314/torchdynamo/tree/dynamo-torchxla-integration. I am helping him to add test and landing this pr to pytorch/xla.

We have tested it in an old version and was able to get inference running.

Upstream pytorch pr pytorch/pytorch#87741

JackCaoG · 2022-10-25T22:21:33Z

Test should pass now, want to take another pass to review some of the cpp implementation and will request for review.

JackCaoG · 2022-10-25T22:27:08Z

+          std::vector<at::IValue> ivalues;
+          std::vector<torch::lazy::Node*> roots;
+          for (auto& tensor : tensors) {
+            auto xtensor = bridge::TryGetXlaTensor(tensor);


we should check for this value before appending to the list, this might segfault if tensor is not an xlaTensor.

This reverts commit 9181bed.

JackCaoG · 2022-10-25T23:13:52Z

+
+          // setup the parameters_data
+          std::vector<xla::ComputationClient::DataPtr> parameters_data;
+          auto device = torch_xla::GetCurrentDevice();


This is OK for now, but I felt like it should be something passed to this function from python. @shunting314 How do you think?

JackCaoG · 2022-10-25T23:18:57Z

I will fix comments I raised then I think this pr is good to go. FYI @wconstab

JackCaoG · 2022-10-26T02:14:28Z

Ok I think this one is ready for review now

shunting314

Thanks you very much for helping merging this!

LGTM. But I'll let other people in your team to stamp in case they have other feedbacks

@JackCaoG

# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: #87741 Approved by: https://github.com/wconstab

@JackCaoG

# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab

@JackCaoG

# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab

@JackCaoG

# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab

JackCaoG changed the title ~~[WIP]Dynamo Poc~~ Dynamo Poc Oct 25, 2022

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp

JackCaoG added the dynamo label Oct 25, 2022

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG and others added 8 commits October 25, 2022 22:46

Move shunting's pr to this branch for development

67664d7

Fix RunPostOrder and add test for get_graph_hash

c3a82e5

Add test for _get_tensors_xla_device_data_node

6b06adc

Add test_run_cached_graph and minor clean up

d4b11cc

Fix the test_run_cached_graph

08f6cc6

Delete .gitignore

095b6ee

Revert "Delete .gitignore"

513c9de

This reverts commit 9181bed.

clean up ignore

a29d6d3

JackCaoG force-pushed the shunting/dynamo_poc branch from 488b9bf to a29d6d3 Compare October 25, 2022 22:46

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread test/test_dynamo_integrations_util.py

JackCaoG commented Oct 25, 2022

View reviewed changes

Comment thread test/test_dynamo_integrations_util.py

JackCaoG added 3 commits October 26, 2022 00:34

add cpu run_cached_graph test

e8df273

clean up

cfda0a8

add metrics

43ecfbc

JackCaoG requested review from alanwaketan and shunting314 October 26, 2022 02:14

JackCaoG requested a review from wonjoo-wj October 26, 2022 02:14

shunting314 mentioned this pull request Oct 26, 2022

torchdynamo and xla integration pytorch/pytorch#87741

Closed

shunting314 reviewed Oct 26, 2022

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp

JackCaoG merged commit e3a6e3a into master Oct 27, 2022

JackCaoG mentioned this pull request Feb 1, 2023

[Feature Request][XLA] Support fallback for the dynamo-xla bridge pytorch/pytorch#93601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamo Poc#4119

Dynamo Poc#4119
JackCaoG merged 11 commits intomasterfrom
shunting/dynamo_poc

JackCaoG commented Oct 25, 2022 •

edited

Loading

Uh oh!

JackCaoG commented Oct 25, 2022

Uh oh!

JackCaoG Oct 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Oct 25, 2022

Uh oh!

Uh oh!

JackCaoG commented Oct 25, 2022

Uh oh!

Uh oh!

Uh oh!

JackCaoG commented Oct 26, 2022

Uh oh!

shunting314 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JackCaoG commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Oct 25, 2022

Uh oh!

JackCaoG Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackCaoG commented Oct 25, 2022

Uh oh!

Uh oh!

Uh oh!

JackCaoG commented Oct 26, 2022

Uh oh!

shunting314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JackCaoG commented Oct 25, 2022 •

edited

Loading