Conversation
Collaborator
Author
|
Test should pass now, want to take another pass to review some of the cpp implementation and will request for review. |
JackCaoG
commented
Oct 25, 2022
| std::vector<at::IValue> ivalues; | ||
| std::vector<torch::lazy::Node*> roots; | ||
| for (auto& tensor : tensors) { | ||
| auto xtensor = bridge::TryGetXlaTensor(tensor); |
Collaborator
Author
There was a problem hiding this comment.
we should check for this value before appending to the list, this might segfault if tensor is not an xlaTensor.
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
This reverts commit 9181bed.
488b9bf to
a29d6d3
Compare
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
|
|
||
| // setup the parameters_data | ||
| std::vector<xla::ComputationClient::DataPtr> parameters_data; | ||
| auto device = torch_xla::GetCurrentDevice(); |
Collaborator
Author
There was a problem hiding this comment.
This is OK for now, but I felt like it should be something passed to this function from python. @shunting314 How do you think?
JackCaoG
commented
Oct 25, 2022
Collaborator
Author
|
I will fix comments I raised then I think this pr is good to go. FYI @wconstab |
JackCaoG
commented
Oct 25, 2022
JackCaoG
commented
Oct 25, 2022
Collaborator
Author
|
Ok I think this one is ready for review now |
shunting314
reviewed
Oct 26, 2022
Collaborator
shunting314
left a comment
There was a problem hiding this comment.
Thanks you very much for helping merging this!
LGTM. But I'll let other people in your team to stamp in case they have other feedbacks
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Oct 29, 2022
# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: #87741 Approved by: https://github.com/wconstab
kulinseth
pushed a commit
to kulinseth/pytorch
that referenced
this pull request
Nov 5, 2022
# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab
kulinseth
pushed a commit
to kulinseth/pytorch
that referenced
this pull request
Dec 10, 2022
# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 25, 2026
# Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.346 | 1.045 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.153 | 1.007 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.381 | 1.039 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.045 | 1.018 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 1.562 | 1.021 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.303 | 1.069 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 1.278 | 1.025 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.076 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 2.224 | 0.978 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 1.81 | 1.025 | +-------------------------+--------------------+-------------------------+ | geomean | 1.38101 | 1.02324 | +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: pytorch/xla#4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: pytorch#87741 Approved by: https://github.com/wconstab
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pr is based on @shunting314 branch at https://github.com/shunting314/torchdynamo/tree/dynamo-torchxla-integration. I am helping him to add test and landing this pr to pytorch/xla.
We have tested it in an old version and was able to get inference running.
Upstream pytorch pr pytorch/pytorch#87741