dynamo/torchxla integration: trace on xla rather than eager by shunting314 · Pull Request #88904 · pytorch/pytorch

shunting314 · 2022-11-11T19:51:50Z

In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because

we try 2 things together
1. let dynamo trace the model on xla rather than eager
2. enable training
It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.

+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+

Example command

GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2022-11-11T19:51:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88904

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 371e50f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2022-11-11T20:59:35Z

i think we should be doing the mark step on the input (and output) boundary of the dynamo-xla backend, in order to ensure our graph-tracing starts on fresh xlatensors not xlatensors that have already accumulated some trace. I think you're also working on that change in parallel. Can we just get that working first and then not need this?

@wconstab hmm, I think we still need it.

E.g., if we skip this mark_step call and do it inside the optimized function, then that optimized function will incur overhead of computing the randomized inputs. The other call we compare with does not have this overhead.

Then the comparison would be unfair.

We add mark_step or sync call in a few other places for the baseline. Since we trace on xla, the baseline is also running on XLA, we need explicitly sync tensors for it.

@wconstab I added a sync when we enter the optimized function.

But I think we may not need a mandate sync when we leave the function. Leave it as it is have the advantage that we potentially accumulate a large graph. Let me know if there are other concerns.

ahh.. when we exit the optimized function, isn't the last thing we do to invoke an API that executes the cached xla graph? that API may be equivalent to mark_step in all the ways that matter. all of the tensors we return from the optimized function should be

XLATensors

backed by devicedata, not IR value

so any operations performed in eager land after one of our optimized functions returns should be starting fresh graphs. If you returned XLAtensor backed by IR value from the optimized function, i don't think the xla runtime would behave the way you expect if you trace additional IR onto the graph and then execute it.

at least for the first design, i think it is best to break the XLA graph on entry/exit of the dynamo function

mark_step is not synchronous by defualt, there a arg you can toggle the wait behavior. You are right that we actually need to wait for previous computation to finish. In lazy code we achieved that by trying to grab the deviceContext's lock. We are currently not doing that, so let me draft a quick pr to do something similar to https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor.cpp#L1035 in our _run_cached_graph api.

Instead of making mark_step synchronous here, I think it is better to make _run_cached_graph grab the device lock. This way we can overlap the dynamo bridges that prepare inputs to _run_cached_graph and the device execution.

also, i think there is a point in the dynamo benchmark where cuda.sync happens before/after measurements for this same reason.

yep, I think I know which of the code you mean. But one difference here is, for the cuda.sync, we want its execution time to be included to the benchmark execution time. But for this mark_step, we want to exclude it.

now I think about it a bit more there are two cases

When we first time tracing the graph. This one is actually fine because we trace using the lazy and lock thing should already been taking care of.

When we run the cached graph, since we already run torch_xla._XLAC._xla_sync_multi(args, []), we know that at least that input to the cached graph is materalized. However in a very rare chance that there are still some ongoing device execution on XLA device. We can not execute more than 1 graph at any given time, so _run_cahce_graph should still try to grab the device lock before proceed to execution.

pytorch/xla#4195

shunting314 · 2022-11-11T23:34:08Z

I may address comments on this PR a bit slow since I want to make sure training works based on this PR first. Sorry for potential delays.

wconstab

Minor comments, but you're fixing them. stamp to unblock. thanks! LGTM

JackCaoG

Thanks! Working on the xla side test fix.

Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from #88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: #89028 Approved by: https://github.com/jansel

JackCaoG · 2022-11-16T00:30:14Z

Hey Shunting, can you update https://github.com/pytorch/pytorch/blob/master/.github/ci_commit_pins/xla.txt to e9434c7504ceae4d6c317936a97932c426a582d9 which is the commit hash for https://github.com/pytorch/xla/pull/4205/commits . More details in https://github.com/pytorch/xla/tree/master/.circleci#coodinating-merges-for-breaking-pytorch-prs

shunting314 · 2022-11-21T19:47:53Z

Working with Jack on #4205 to resolve the CI failure for some xla tests. Will commit once we resolve the test failures.

shunting314 · 2022-11-22T00:53:40Z

@pytorchbot merge

pytorchmergebot · 2022-11-22T00:55:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from pytorch#88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: pytorch#89028 Approved by: https://github.com/jansel

…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel

Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from pytorch#88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: pytorch#89028 Approved by: https://github.com/jansel

…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel

shunting314 requested review from JackCaoG, jansel and wconstab November 11, 2022 19:51

github-actions Bot added ciflow/inductor module: dynamo labels Nov 11, 2022

shunting314 commented Nov 11, 2022

View reviewed changes

Comment thread torch/_dynamo/optimizations/torchxla_integration.py Outdated

Comment thread benchmarks/dynamo/common.py Outdated

shunting314 marked this pull request as ready for review November 11, 2022 20:53

wconstab reviewed Nov 11, 2022

View reviewed changes

shunting314 force-pushed the dynamo-tracing-on-xla branch 2 times, most recently from 288c1e7 to ea1eb96 Compare November 11, 2022 21:52

JackCaoG reviewed Nov 11, 2022

View reviewed changes

Comment thread benchmarks/dynamo/common.py

JackCaoG reviewed Nov 11, 2022

View reviewed changes

Comment thread torch/_dynamo/optimizations/torchxla_integration.py

shunting314 force-pushed the dynamo-tracing-on-xla branch from ea1eb96 to d1657fa Compare November 14, 2022 21:10

shunting314 mentioned this pull request Nov 15, 2022

allow loading model from a path in torchbench #89028

Closed

shunting314 force-pushed the dynamo-tracing-on-xla branch from d1657fa to 6eb9f9b Compare November 15, 2022 19:49

shunting314 added the topic: not user facing topic category label Nov 15, 2022