dynamo/torchxla integration: trace on xla rather than eager#88904
dynamo/torchxla integration: trace on xla rather than eager#88904shunting314 wants to merge 1 commit intomasterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88904
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 371e50f: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
i think we should be doing the mark step on the input (and output) boundary of the dynamo-xla backend, in order to ensure our graph-tracing starts on fresh xlatensors not xlatensors that have already accumulated some trace. I think you're also working on that change in parallel. Can we just get that working first and then not need this?
There was a problem hiding this comment.
@wconstab hmm, I think we still need it.
E.g., if we skip this mark_step call and do it inside the optimized function, then that optimized function will incur overhead of computing the randomized inputs. The other call we compare with does not have this overhead.
Then the comparison would be unfair.
There was a problem hiding this comment.
We add mark_step or sync call in a few other places for the baseline. Since we trace on xla, the baseline is also running on XLA, we need explicitly sync tensors for it.
There was a problem hiding this comment.
@wconstab I added a sync when we enter the optimized function.
But I think we may not need a mandate sync when we leave the function. Leave it as it is have the advantage that we potentially accumulate a large graph. Let me know if there are other concerns.
There was a problem hiding this comment.
ahh.. when we exit the optimized function, isn't the last thing we do to invoke an API that executes the cached xla graph? that API may be equivalent to mark_step in all the ways that matter. all of the tensors we return from the optimized function should be
- XLATensors
- backed by devicedata, not IR value
so any operations performed in eager land after one of our optimized functions returns should be starting fresh graphs. If you returned XLAtensor backed by IR value from the optimized function, i don't think the xla runtime would behave the way you expect if you trace additional IR onto the graph and then execute it.
at least for the first design, i think it is best to break the XLA graph on entry/exit of the dynamo function
There was a problem hiding this comment.
mark_step is not synchronous by defualt, there a arg you can toggle the wait behavior. You are right that we actually need to wait for previous computation to finish. In lazy code we achieved that by trying to grab the deviceContext's lock. We are currently not doing that, so let me draft a quick pr to do something similar to https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor.cpp#L1035 in our _run_cached_graph api.
There was a problem hiding this comment.
Instead of making mark_step synchronous here, I think it is better to make _run_cached_graph grab the device lock. This way we can overlap the dynamo bridges that prepare inputs to _run_cached_graph and the device execution.
There was a problem hiding this comment.
also, i think there is a point in the dynamo benchmark where cuda.sync happens before/after measurements for this same reason.
yep, I think I know which of the code you mean. But one difference here is, for the cuda.sync, we want its execution time to be included to the benchmark execution time. But for this mark_step, we want to exclude it.
There was a problem hiding this comment.
now I think about it a bit more there are two cases
- When we first time tracing the graph. This one is actually fine because we trace using the lazy and lock thing should already been taking care of.
- When we run the cached graph, since we already run
torch_xla._XLAC._xla_sync_multi(args, []), we know that at least that input to the cached graph is materalized. However in a very rare chance that there are still some ongoing device execution on XLA device. We can not execute more than 1 graph at any given time, so_run_cahce_graphshould still try to grab the device lock before proceed to execution.
288c1e7 to
ea1eb96
Compare
|
I may address comments on this PR a bit slow since I want to make sure training works based on this PR first. Sorry for potential delays. |
ea1eb96 to
d1657fa
Compare
d1657fa to
6eb9f9b
Compare
wconstab
left a comment
There was a problem hiding this comment.
Minor comments, but you're fixing them. stamp to unblock. thanks! LGTM
JackCaoG
left a comment
There was a problem hiding this comment.
Thanks! Working on the xla side test fix.
6eb9f9b to
1419f6e
Compare
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from #88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: #89028 Approved by: https://github.com/jansel
|
Hey Shunting, can you update https://github.com/pytorch/pytorch/blob/master/.github/ci_commit_pins/xla.txt to |
1419f6e to
e7e7e92
Compare
e7e7e92 to
d91a563
Compare
|
Working with Jack on #4205 to resolve the CI failure for some xla tests. Will commit once we resolve the test failures. |
d91a563 to
371e50f
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from pytorch#88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: pytorch#89028 Approved by: https://github.com/jansel
…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from pytorch#88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: pytorch#89028 Approved by: https://github.com/jansel
…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because
Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.
This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.
Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
Example command
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire