Skip to content

dynamo/torchxla integration: trace on xla rather than eager#88904

Closed
shunting314 wants to merge 1 commit intomasterfrom
dynamo-tracing-on-xla
Closed

dynamo/torchxla integration: trace on xla rather than eager#88904
shunting314 wants to merge 1 commit intomasterfrom
dynamo-tracing-on-xla

Conversation

@shunting314
Copy link
Copy Markdown
Contributor

@shunting314 shunting314 commented Nov 11, 2022

In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because

  • we try 2 things together
    1. let dynamo trace the model on xla rather than eager
    2. enable training
  • It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.

+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+

Example command

GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Nov 11, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88904

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 371e50f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment thread torch/_dynamo/optimizations/torchxla_integration.py Outdated
Comment thread benchmarks/dynamo/common.py Outdated
@shunting314 shunting314 marked this pull request as ready for review November 11, 2022 20:53
Comment thread benchmarks/dynamo/common.py Outdated
Copy link
Copy Markdown
Contributor

@wconstab wconstab Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should be doing the mark step on the input (and output) boundary of the dynamo-xla backend, in order to ensure our graph-tracing starts on fresh xlatensors not xlatensors that have already accumulated some trace. I think you're also working on that change in parallel. Can we just get that working first and then not need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab hmm, I think we still need it.

E.g., if we skip this mark_step call and do it inside the optimized function, then that optimized function will incur overhead of computing the randomized inputs. The other call we compare with does not have this overhead.

Then the comparison would be unfair.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add mark_step or sync call in a few other places for the baseline. Since we trace on xla, the baseline is also running on XLA, we need explicitly sync tensors for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab I added a sync when we enter the optimized function.

But I think we may not need a mandate sync when we leave the function. Leave it as it is have the advantage that we potentially accumulate a large graph. Let me know if there are other concerns.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh.. when we exit the optimized function, isn't the last thing we do to invoke an API that executes the cached xla graph? that API may be equivalent to mark_step in all the ways that matter. all of the tensors we return from the optimized function should be

  • XLATensors
  • backed by devicedata, not IR value

so any operations performed in eager land after one of our optimized functions returns should be starting fresh graphs. If you returned XLAtensor backed by IR value from the optimized function, i don't think the xla runtime would behave the way you expect if you trace additional IR onto the graph and then execute it.

at least for the first design, i think it is best to break the XLA graph on entry/exit of the dynamo function

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark_step is not synchronous by defualt, there a arg you can toggle the wait behavior. You are right that we actually need to wait for previous computation to finish. In lazy code we achieved that by trying to grab the deviceContext's lock. We are currently not doing that, so let me draft a quick pr to do something similar to https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor.cpp#L1035 in our _run_cached_graph api.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making mark_step synchronous here, I think it is better to make _run_cached_graph grab the device lock. This way we can overlap the dynamo bridges that prepare inputs to _run_cached_graph and the device execution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i think there is a point in the dynamo benchmark where cuda.sync happens before/after measurements for this same reason.

yep, I think I know which of the code you mean. But one difference here is, for the cuda.sync, we want its execution time to be included to the benchmark execution time. But for this mark_step, we want to exclude it.

Copy link
Copy Markdown
Collaborator

@JackCaoG JackCaoG Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think about it a bit more there are two cases

  1. When we first time tracing the graph. This one is actually fine because we trace using the lazy and lock thing should already been taking care of.
  2. When we run the cached graph, since we already run torch_xla._XLAC._xla_sync_multi(args, []), we know that at least that input to the cached graph is materalized. However in a very rare chance that there are still some ongoing device execution on XLA device. We can not execute more than 1 graph at any given time, so _run_cahce_graph should still try to grab the device lock before proceed to execution.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shunting314 shunting314 force-pushed the dynamo-tracing-on-xla branch 2 times, most recently from 288c1e7 to ea1eb96 Compare November 11, 2022 21:52
Comment thread benchmarks/dynamo/common.py
Comment thread torch/_dynamo/optimizations/torchxla_integration.py
@shunting314
Copy link
Copy Markdown
Contributor Author

I may address comments on this PR a bit slow since I want to make sure training works based on this PR first. Sorry for potential delays.

Comment thread benchmarks/dynamo/common.py
Comment thread benchmarks/dynamo/common.py
Comment thread benchmarks/dynamo/common.py
Comment thread torch/_dynamo/optimizations/torchxla_integration.py Outdated
Comment thread torch/_dynamo/optimizations/torchxla_integration.py Outdated
Comment thread torch/_dynamo/optimizations/torchxla_integration.py Outdated
Comment thread torch/_dynamo/optimizations/torchxla_integration.py
Comment thread torch/_dynamo/optimizations/torchxla_integration.py
Copy link
Copy Markdown
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but you're fixing them. stamp to unblock. thanks! LGTM

Copy link
Copy Markdown
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Working on the xla side test fix.

pytorchmergebot pushed a commit that referenced this pull request Nov 16, 2022
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

This PR is split out from #88904

Here is the usage:

        Specify the path and class name of the model in format like:
        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

        Due to the fact that dynamo changes current working directory,
        the path should be an absolute path.

        The class should have a method get_example_inputs to return the inputs
        for the model. An example looks like
        ```
        class LinearModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(10, 10)

            def forward(self, x):
                return self.linear(x)

            def get_example_inputs(self):
                return (torch.randn(2, 10),)
        ```

Test command:
```
# python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager
WARNING:common:torch.cuda.is_available() == False, using CPU
cpu  eval  LinearModel                        0.824x p=0.00
```

Content of model_collection.py
```
from torch import nn
import torch

class LinearModel(nn.Module):
    """
    AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
    Make sure this model calls 2 linear layers to avoid being skipped.
    """
    def __init__(self, nlayer=2):
        super().__init__()
        layers = []
        for _ in range(nlayer):
            layers.append(nn.Linear(10, 10))
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def get_example_inputs(self):
        return (torch.randn(2, 10),)
```

Pull Request resolved: #89028
Approved by: https://github.com/jansel
@JackCaoG
Copy link
Copy Markdown
Collaborator

Hey Shunting, can you update https://github.com/pytorch/pytorch/blob/master/.github/ci_commit_pins/xla.txt to e9434c7504ceae4d6c317936a97932c426a582d9 which is the commit hash for https://github.com/pytorch/xla/pull/4205/commits . More details in https://github.com/pytorch/xla/tree/master/.circleci#coodinating-merges-for-breaking-pytorch-prs

@shunting314 shunting314 force-pushed the dynamo-tracing-on-xla branch from 1419f6e to e7e7e92 Compare November 16, 2022 00:41
@shunting314 shunting314 requested a review from a team as a code owner November 16, 2022 00:41
@shunting314 shunting314 force-pushed the dynamo-tracing-on-xla branch from e7e7e92 to d91a563 Compare November 16, 2022 00:45
@shunting314
Copy link
Copy Markdown
Contributor Author

Working with Jack on #4205 to resolve the CI failure for some xla tests. Will commit once we resolve the test failures.

@shunting314 shunting314 force-pushed the dynamo-tracing-on-xla branch from d91a563 to 371e50f Compare November 21, 2022 21:36
@shunting314
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2022
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

This PR is split out from pytorch#88904

Here is the usage:

        Specify the path and class name of the model in format like:
        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

        Due to the fact that dynamo changes current working directory,
        the path should be an absolute path.

        The class should have a method get_example_inputs to return the inputs
        for the model. An example looks like
        ```
        class LinearModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(10, 10)

            def forward(self, x):
                return self.linear(x)

            def get_example_inputs(self):
                return (torch.randn(2, 10),)
        ```

Test command:
```
# python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager
WARNING:common:torch.cuda.is_available() == False, using CPU
cpu  eval  LinearModel                        0.824x p=0.00
```

Content of model_collection.py
```
from torch import nn
import torch

class LinearModel(nn.Module):
    """
    AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
    Make sure this model calls 2 linear layers to avoid being skipped.
    """
    def __init__(self, nlayer=2):
        super().__init__()
        layers = []
        for _ in range(nlayer):
            layers.append(nn.Linear(10, 10))
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def get_example_inputs(self):
        return (torch.randn(2, 10),)
```

Pull Request resolved: pytorch#89028
Approved by: https://github.com/jansel
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…88904)

In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because
- we try 2 things together
   1. let dynamo trace the model on xla rather than eager
   2. enable training
- It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+
```

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
```

Pull Request resolved: pytorch#88904
Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
@github-actions github-actions Bot deleted the dynamo-tracing-on-xla branch May 30, 2024 01:54
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

This PR is split out from pytorch#88904

Here is the usage:

        Specify the path and class name of the model in format like:
        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

        Due to the fact that dynamo changes current working directory,
        the path should be an absolute path.

        The class should have a method get_example_inputs to return the inputs
        for the model. An example looks like
        ```
        class LinearModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(10, 10)

            def forward(self, x):
                return self.linear(x)

            def get_example_inputs(self):
                return (torch.randn(2, 10),)
        ```

Test command:
```
# python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager
WARNING:common:torch.cuda.is_available() == False, using CPU
cpu  eval  LinearModel                        0.824x p=0.00
```

Content of model_collection.py
```
from torch import nn
import torch

class LinearModel(nn.Module):
    """
    AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
    Make sure this model calls 2 linear layers to avoid being skipped.
    """
    def __init__(self, nlayer=2):
        super().__init__()
        layers = []
        for _ in range(nlayer):
            layers.append(nn.Linear(10, 10))
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def get_example_inputs(self):
        return (torch.randn(2, 10),)
```

Pull Request resolved: pytorch#89028
Approved by: https://github.com/jansel
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
…88904)

In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because
- we try 2 things together
   1. let dynamo trace the model on xla rather than eager
   2. enable training
- It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+
```

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
```

Pull Request resolved: pytorch#88904
Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants