Skip to content

[nnc] Per-operator benchmarks#51093

Closed
bertmaher wants to merge 3 commits intogh/bertmaher/56/basefrom
gh/bertmaher/56/head
Closed

[nnc] Per-operator benchmarks#51093
bertmaher wants to merge 3 commits intogh/bertmaher/56/basefrom
gh/bertmaher/56/head

Conversation

@bertmaher
Copy link
Copy Markdown
Contributor

@bertmaher bertmaher commented Jan 26, 2021

Stack from ghstack:

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels. We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using sleef, so we're using scalar calls to libm with predictably lousy
performance. Fix incoming.

This benchmark is similar to the pure NNC approach in microbenchmarks.py, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: D26069791

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 26, 2021

💊 CI failures summary and remediations

As of commit 18d8ecd (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 26 22:03:24 AssertionError: mypy failed: torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding.fx_glow' [import]
Jan 26 22:02:11 ----------------------------------------------------------------------
Jan 26 22:02:18   test_doc_examples (__main__.TestTypeHints) ... ok (7.444s)
Jan 26 22:03:24   test_run_mypy (__main__.TestTypeHints) ... FAIL (65.976s)
Jan 26 22:03:24 
Jan 26 22:03:24 ======================================================================
Jan 26 22:03:24 FAIL [65.976s]: test_run_mypy (__main__.TestTypeHints) [mypy.ini]
Jan 26 22:03:24 ----------------------------------------------------------------------
Jan 26 22:03:24 Traceback (most recent call last):
Jan 26 22:03:24   File "test_type_hints.py", line 171, in test_run_mypy
Jan 26 22:03:24     self.fail(f"mypy failed: {stdout} {stderr}")
Jan 26 22:03:24 AssertionError: mypy failed: torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding.fx_glow'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding'  [import]
Jan 26 22:03:24 Found 4 errors in 1 file (checked 1212 source files)
Jan 26 22:03:24  
Jan 26 22:03:24 
Jan 26 22:03:24 ----------------------------------------------------------------------
Jan 26 22:03:24 Ran 2 tests in 73.421s
Jan 26 22:03:24 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

bertmaher added a commit that referenced this pull request Jan 26, 2021
Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)!

ghstack-source-id: 120375140
Pull Request resolved: #51093
@bertmaher
Copy link
Copy Markdown
Contributor Author

Results:

op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75

Comment thread benchmarks/cpp/tensorexpr/bench_ops.py Outdated
traced = torch.jit.trace(lambda x: op(x), (x))
[traced(x) for _ in range(2)]
torch.testing.assert_allclose(op(x), traced(x))
teager = timeit.timeit(stmt="op(x)", globals=globals(), number=100)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: maybe a warmup to be more faithful?

Also disable the multi-threading in PyTorch from the benchmark with torch.set_num_threads(1)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I always nuke threading from orbit with numactl or taskset but yeah good idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and warmup is the [traced(x) for _ in range(2)] above, but it's easy to overlook :-p

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 26, 2021

Codecov Report

Merging #51093 (b9222b9) into gh/bertmaher/56/base (250c711) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                   Coverage Diff                    @@
##           gh/bertmaher/56/base   #51093      +/-   ##
========================================================
- Coverage                 80.92%   80.92%   -0.01%     
========================================================
  Files                      1926     1926              
  Lines                    210104   210104              
========================================================
- Hits                     170021   170020       -1     
- Misses                    40083    40084       +1     

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)!

[ghstack-poisoned]
Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in c402944.

@facebook-github-bot facebook-github-bot deleted the gh/bertmaher/56/head branch January 30, 2021 15:21
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#51093

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675

Test Plan:
```
op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75
```

Reviewed By: zheng-xq

Differential Revision: D26069791

fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants