Skip to content

Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches#12841

Closed
zou3519 wants to merge 10 commits intopytorch:masterfrom
zou3519:get_device
Closed

Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches#12841
zou3519 wants to merge 10 commits intopytorch:masterfrom
zou3519:get_device

Conversation

@zou3519
Copy link
Copy Markdown
Contributor

@zou3519 zou3519 commented Oct 18, 2018

tensor.get_device() went through two dispatches: once to the native
function
get_device(), and another when get_device calls _th_get_device().
This PR avoids the dispatch by directly implementing the get_device
function
as a method on Tensor.

Future Work:

  • Investigate caching Device on TensorImpl. This will probably bring the
    tensor.get_device down to 2ns, but I'm not sure it's worth it.

before:

------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                             8 ns          8 ns   89407911
BM_TensorIsCuda                          24 ns         24 ns   29313017
BM_TensorIsSparse                        27 ns         27 ns   26083160
BM_TensorTypeIsCuda                      11 ns         11 ns   65128120
BM_TensorNumel                           11 ns         11 ns   68314492
BM_TensorGetDevice                       71 ns         71 ns    9633125
BM_DeviceGuardCtor                      173 ns        173 ns    4067173
BM_DeviceGuard                          232 ns        232 ns    3009690

after:

------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                            10 ns         10 ns   69803872
BM_TensorIsCuda                           2 ns          2 ns  321626683
BM_TensorIsSparse                         6 ns          6 ns  177045382
BM_TensorNumel                           12 ns         12 ns   58770533
BM_TensorGetDevice                        4 ns          4 ns  128113396
BM_DeviceGuardCtor                       52 ns         52 ns   14997278
BM_DeviceGuard                          158 ns        158 ns    5767248

@zou3519 zou3519 requested review from ezyang and gchanan October 18, 2018 22:06
Comment thread aten/src/ATen/core/TensorMethods.h Outdated

This comment was marked as off-topic.

Comment thread aten/src/ATen/core/TensorMethods.h Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but I do have some comments. Please read.

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 19, 2018

Er looks like what I did here (bypassing the dispatch) is not okay at all because it skips python bindings. Let me see if I can fix it.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Oct 19, 2018

Hmm... I suspect all of the pre-canned functions on Tensor have hand written bindings. Not great.

@zou3519 zou3519 force-pushed the get_device branch 2 times, most recently from 4ce85f1 to 0f7ced1 Compare October 22, 2018 15:46
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zou3519 zou3519 changed the title Speed up tensor.get_device() by avoiding dispatches Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches Oct 22, 2018
@zou3519 zou3519 force-pushed the get_device branch 2 times, most recently from c6a3290 to 92b08eb Compare October 22, 2018 18:19
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment thread aten/src/ATen/core/TensorMethods.h Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Oct 22, 2018

@zou3519 Have you given any thought on how to prevent people from virtualizing this method again? :)

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 22, 2018

@ezyang Not sure. I imagine setting up a benchmark test for these functions would do the trick (virtualizing them vs not virtualizing is like a 10x difference), or just by adding a comment into the implementation

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Oct 23, 2018

I guess a comment will do for now.

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 23, 2018

https://stackoverflow.com/questions/22911112/how-to-detect-if-a-method-is-virtual suggests a way to detect if a method is virtual on gcc >= 7 and most if not all versions of clang. I'll try to see if I can do a static assert for this.

@zou3519 zou3519 mentioned this pull request Oct 24, 2018
21 tasks
@zou3519 zou3519 force-pushed the get_device branch 2 times, most recently from a9982ab to 47f5c2c Compare October 24, 2018 19:48
@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 24, 2018

Nvm, that stackoverflow post explains how to detect if a class has virtual methods, not whether if a specific method is virtual. I'll stick with the comment then -- I've moved the implementations to TensorImpl as requested.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 25, 2018

Test failures look unrelated (they are all build timeouts).

When you get the chance, could you take another look at the changes again, @ezyang?

Comment thread aten/src/ATen/core/TensorImpl.h Outdated

This comment was marked as off-topic.

Comment thread aten/src/ATen/core/TensorImpl.h Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/ATen/core/TensorImpl.h Outdated

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way better. Some minor comments.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

`tensor.get_device()` went through two dispatches: once to the native
function
`get_device()`, and another when `get_device` calls `_th_get_device()`.
This PR avoids the dispatch by directly implementing the `get_device`
function
as a method on Tensor.

Future Work:
- Investigate caching Device on TensorImpl. This will probably bring the
  tensor.get_device down to 10ns, but I'm not sure it is worth it.

```
  after:
  ------------------------------------------------------------------------
  Benchmark                                 Time           CPU
  Iterations
  ------------------------------------------------------------------------
  BM_TensorTypeId                           0 ns          0 ns
  1000000000
  BM_TensorType                            10 ns         10 ns
  69803872
  BM_TensorIsCuda                           2 ns          2 ns
  321626683
  BM_TensorIsSparse                         6 ns          6 ns
  177045382
  BM_TensorNumel                           12 ns         12 ns
  58770533
  BM_TensorGetDevice                        4 ns          4 ns
  128113396
  BM_DeviceGuardCtor                       52 ns         52 ns
  14997278
  BM_DeviceGuard                          158 ns        158 ns
  5767248

  before:

  ------------------------------------------------------------------------
  Benchmark                                 Time           CPU
  Iterations
  ------------------------------------------------------------------------
  BM_TensorTypeId                           0 ns          0 ns
  1000000000
  BM_TensorType                             8 ns          8 ns
  89407911
  BM_TensorIsCuda                          24 ns         24 ns
  29313017
  BM_TensorIsSparse                        27 ns         27 ns
  26083160
  BM_TensorTypeIsCuda                      11 ns         11 ns
  65128120
  BM_TensorNumel                           11 ns         11 ns
  68314492
  BM_TensorGetDevice                       71 ns         71 ns
  9633125
  BM_DeviceGuardCtor                      173 ns        173 ns
  4067173
  BM_DeviceGuard                          232 ns        232 ns
  3009690
 ```
Copied & pasted them from the old generated/python_variable_methods &
generated/python_torch_functions.
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 26, 2018
…atches (#12841)

Summary:
`tensor.get_device()` went through two dispatches: once to the native
function
`get_device()`, and another when `get_device` calls `_th_get_device()`.
This PR avoids the dispatch by directly implementing the `get_device`
function
as a method on Tensor.

Future Work:
- Investigate caching Device on TensorImpl. This will probably bring the
  tensor.get_device down to 2ns, but I'm not sure it's worth it.

before:
```
------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                             8 ns          8 ns   89407911
BM_TensorIsCuda                          24 ns         24 ns   29313017
BM_TensorIsSparse                        27 ns         27 ns   26083160
BM_TensorTypeIsCuda                      11 ns         11 ns   65128120
BM_TensorNumel                           11 ns         11 ns   68314492
BM_TensorGetDevice                       71 ns         71 ns    9633125
BM_DeviceGuardCtor                      173 ns        173 ns    4067173
BM_DeviceGuard                          232 ns        232 ns    3009690
```

after:
```
------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                            10 ns         10 ns   69803872
BM_TensorIsCuda                           2 ns          2 ns  321626683
BM_TensorIsSparse                         6 ns          6 ns  177045382
BM_TensorNumel                           12 ns         12 ns   58770533
BM_TensorGetDevice                        4 ns          4 ns  128113396
BM_DeviceGuardCtor                       52 ns         52 ns   14997278
BM_DeviceGuard                          158 ns        158 ns    5767248

```
Pull Request resolved: pytorch/pytorch#12841

Differential Revision: D10489353

Pulled By: zou3519

fbshipit-source-id: a596bc77352f21d5d35433c6de02c2f65aab5f9e
facebook-github-bot pushed a commit that referenced this pull request Nov 7, 2018
Summary:
Followup to #12841

Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()

This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: #13590

Reviewed By: ezyang

Differential Revision: D12929301

Pulled By: zou3519

fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
@ezyang ezyang added the merged label Jun 25, 2019
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…atches (pytorch#12841)

Summary:
`tensor.get_device()` went through two dispatches: once to the native
function
`get_device()`, and another when `get_device` calls `_th_get_device()`.
This PR avoids the dispatch by directly implementing the `get_device`
function
as a method on Tensor.

Future Work:
- Investigate caching Device on TensorImpl. This will probably bring the
  tensor.get_device down to 2ns, but I'm not sure it's worth it.

before:
```
------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                             8 ns          8 ns   89407911
BM_TensorIsCuda                          24 ns         24 ns   29313017
BM_TensorIsSparse                        27 ns         27 ns   26083160
BM_TensorTypeIsCuda                      11 ns         11 ns   65128120
BM_TensorNumel                           11 ns         11 ns   68314492
BM_TensorGetDevice                       71 ns         71 ns    9633125
BM_DeviceGuardCtor                      173 ns        173 ns    4067173
BM_DeviceGuard                          232 ns        232 ns    3009690
```

after:
```
------------------------------------------------------------------------
Benchmark                                 Time           CPU Iterations
------------------------------------------------------------------------
BM_TensorTypeId                           0 ns          0 ns 1000000000
BM_TensorType                            10 ns         10 ns   69803872
BM_TensorIsCuda                           2 ns          2 ns  321626683
BM_TensorIsSparse                         6 ns          6 ns  177045382
BM_TensorNumel                           12 ns         12 ns   58770533
BM_TensorGetDevice                        4 ns          4 ns  128113396
BM_DeviceGuardCtor                       52 ns         52 ns   14997278
BM_DeviceGuard                          158 ns        158 ns    5767248

```
Pull Request resolved: pytorch#12841

Differential Revision: D10489353

Pulled By: zou3519

fbshipit-source-id: a596bc77352f21d5d35433c6de02c2f65aab5f9e
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…3590)

Summary:
Followup to pytorch#12841

Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()

This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: pytorch#13590

Reviewed By: ezyang

Differential Revision: D12929301

Pulled By: zou3519

fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants