[nnc] Per-operator benchmarks by bertmaher · Pull Request #51093 · pytorch/pytorch

bertmaher · 2021-01-26T06:08:55Z

Stack from ghstack:

[nnc] Refactor generation of intrinsics to reduce the amount of macro-hell #51125 [nnc] Refactor generation of intrinsics to reduce the amount of macro-hell
[nnc] Per-operator benchmarks #51093 [nnc] Per-operator benchmarks

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels. We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using sleef, so we're using scalar calls to libm with predictably lousy
performance. Fix incoming.

This benchmark is similar to the pure NNC approach in microbenchmarks.py, but
will include the overhead of dispatching the fused kernel through TorchScript.

Differential Revision: D26069791

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! [ghstack-poisoned]

facebook-github-bot · 2021-01-26T06:09:01Z

💊 CI failures summary and remediations

As of commit 18d8ecd (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 26 22:03:24 AssertionError: mypy failed: torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding.fx_glow'  [import]

Jan 26 22:02:11 ----------------------------------------------------------------------
Jan 26 22:02:18   test_doc_examples (__main__.TestTypeHints) ... ok (7.444s)
Jan 26 22:03:24   test_run_mypy (__main__.TestTypeHints) ... FAIL (65.976s)
Jan 26 22:03:24 
Jan 26 22:03:24 ======================================================================
Jan 26 22:03:24 FAIL [65.976s]: test_run_mypy (__main__.TestTypeHints) [mypy.ini]
Jan 26 22:03:24 ----------------------------------------------------------------------
Jan 26 22:03:24 Traceback (most recent call last):
Jan 26 22:03:24   File "test_type_hints.py", line 171, in test_run_mypy
Jan 26 22:03:24     self.fail(f"mypy failed: {stdout} {stderr}")
Jan 26 22:03:24 AssertionError: mypy failed: torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding.fx_glow'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb'  [import]
Jan 26 22:03:24 torch/fx/experimental/accelerated_graph_module.py:4: error: Cannot find implementation or library stub for module named 'glow.fb.fx_glow_binding'  [import]
Jan 26 22:03:24 Found 4 errors in 1 file (checked 1212 source files)
Jan 26 22:03:24  
Jan 26 22:03:24 
Jan 26 22:03:24 ----------------------------------------------------------------------
Jan 26 22:03:24 Ran 2 tests in 73.421s
Jan 26 22:03:24

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! ghstack-source-id: 120375140 Pull Request resolved: #51093

bertmaher · 2021-01-26T06:11:11Z

Results:

op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75

zheng-xq · 2021-01-26T07:20:21Z

+    traced = torch.jit.trace(lambda x: op(x), (x))
+    [traced(x) for _ in range(2)]
+    torch.testing.assert_allclose(op(x), traced(x))
+    teager = timeit.timeit(stmt="op(x)", globals=globals(), number=100)


Minor: maybe a warmup to be more faithful?

Also disable the multi-threading in PyTorch from the benchmark with torch.set_num_threads(1)?

Heh, I always nuke threading from orbit with numactl or taskset but yeah good idea.

Oh, and warmup is the [traced(x) for _ in range(2)] above, but it's easy to overlook :-p

codecov · 2021-01-26T09:44:01Z

Codecov Report

Merging #51093 (b9222b9) into gh/bertmaher/56/base (250c711) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                   Coverage Diff                    @@
##           gh/bertmaher/56/base   #51093      +/-   ##
========================================================
- Coverage                 80.92%   80.92%   -0.01%     
========================================================
  Files                      1926     1926              
  Lines                    210104   210104              
========================================================
- Hits                     170021   170020       -1     
- Misses                    40083    40084       +1

Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! [ghstack-poisoned]

facebook-github-bot · 2021-01-26T22:14:18Z

This pull request has been merged in c402944.

Summary: Pull Request resolved: pytorch#51093 Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. ghstack-source-id: 120403675 Test Plan: ``` op eager nnc speedup hardswish 0.187 0.051 3.70 hardswish 0.052 0.052 1.00 sigmoid 0.148 1.177 0.13 reciprocal 0.049 0.050 0.98 neg 0.038 0.037 1.02 relu 0.037 0.036 1.03 isnan 0.119 0.020 5.86 log 0.082 1.330 0.06 log10 0.148 1.848 0.08 log1p 0.204 1.413 0.14 log2 0.285 1.167 0.24 exp 0.063 1.123 0.06 expm1 0.402 1.417 0.28 erf 0.167 0.852 0.20 erfc 0.181 1.098 0.16 cos 0.124 0.793 0.16 sin 0.126 0.838 0.15 tan 0.285 1.777 0.16 acos 0.144 1.358 0.11 asin 0.126 1.193 0.11 cosh 0.384 1.761 0.22 sinh 0.390 2.279 0.17 atan 0.240 1.564 0.15 tanh 0.320 2.259 0.14 sqrt 0.043 0.069 0.63 rsqrt 0.118 0.117 1.01 abs 0.038 0.037 1.03 ceil 0.038 0.038 1.01 floor 0.039 0.039 1.00 round 0.039 0.292 0.13 trunc 0.040 0.036 1.12 lgamma 2.045 2.721 0.75 ``` Reviewed By: zheng-xq Differential Revision: D26069791 fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba

facebook-github-bot added the cla signed label Jan 26, 2021

bertmaher requested review from Chillee, ZolotukhinM, eellison and zheng-xq January 26, 2021 06:10

zheng-xq approved these changes Jan 26, 2021

View reviewed changes

bertmaher mentioned this pull request Jan 26, 2021

[nnc] Refactor generation of intrinsics to reduce the amount of macro-hell #51125

Closed

facebook-github-bot closed this in c402944 Jan 26, 2021

facebook-github-bot added the Merged label Jan 26, 2021

facebook-github-bot deleted the gh/bertmaher/56/head branch January 30, 2021 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nnc] Per-operator benchmarks#51093

[nnc] Per-operator benchmarks#51093
bertmaher wants to merge 3 commits intogh/bertmaher/56/basefrom
gh/bertmaher/56/head

bertmaher commented Jan 26, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 26, 2021 •

edited

Loading

Uh oh!

bertmaher commented Jan 26, 2021

Uh oh!

zheng-xq Jan 26, 2021

Uh oh!

bertmaher Jan 26, 2021

Uh oh!

bertmaher Jan 26, 2021

Uh oh!

codecov Bot commented Jan 26, 2021

Uh oh!

facebook-github-bot commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bertmaher commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Uh oh!

bertmaher commented Jan 26, 2021

Uh oh!

zheng-xq Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

bertmaher Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

bertmaher Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jan 26, 2021

Codecov Report

Uh oh!

facebook-github-bot commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bertmaher commented Jan 26, 2021 •

edited

Loading

facebook-github-bot commented Jan 26, 2021 •

edited

Loading