[nnc] Per-operator benchmarks#51093
Conversation
Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 18d8ecd (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! ghstack-source-id: 120375140 Pull Request resolved: #51093
|
Results: |
| traced = torch.jit.trace(lambda x: op(x), (x)) | ||
| [traced(x) for _ in range(2)] | ||
| torch.testing.assert_allclose(op(x), traced(x)) | ||
| teager = timeit.timeit(stmt="op(x)", globals=globals(), number=100) |
There was a problem hiding this comment.
Minor: maybe a warmup to be more faithful?
Also disable the multi-threading in PyTorch from the benchmark with torch.set_num_threads(1)?
There was a problem hiding this comment.
Heh, I always nuke threading from orbit with numactl or taskset but yeah good idea.
There was a problem hiding this comment.
Oh, and warmup is the [traced(x) for _ in range(2)] above, but it's easy to overlook :-p
Codecov Report
@@ Coverage Diff @@
## gh/bertmaher/56/base #51093 +/- ##
========================================================
- Coverage 80.92% 80.92% -0.01%
========================================================
Files 1926 1926
Lines 210104 210104
========================================================
- Hits 170021 170020 -1
- Misses 40083 40084 +1 |
Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! [ghstack-poisoned]
Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. Differential Revision: [D26069791](https://our.internmc.facebook.com/intern/diff/D26069791/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26069791/)! [ghstack-poisoned]
|
This pull request has been merged in c402944. |
Summary: Pull Request resolved: pytorch#51093 Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. ghstack-source-id: 120403675 Test Plan: ``` op eager nnc speedup hardswish 0.187 0.051 3.70 hardswish 0.052 0.052 1.00 sigmoid 0.148 1.177 0.13 reciprocal 0.049 0.050 0.98 neg 0.038 0.037 1.02 relu 0.037 0.036 1.03 isnan 0.119 0.020 5.86 log 0.082 1.330 0.06 log10 0.148 1.848 0.08 log1p 0.204 1.413 0.14 log2 0.285 1.167 0.24 exp 0.063 1.123 0.06 expm1 0.402 1.417 0.28 erf 0.167 0.852 0.20 erfc 0.181 1.098 0.16 cos 0.124 0.793 0.16 sin 0.126 0.838 0.15 tan 0.285 1.777 0.16 acos 0.144 1.358 0.11 asin 0.126 1.193 0.11 cosh 0.384 1.761 0.22 sinh 0.390 2.279 0.17 atan 0.240 1.564 0.15 tanh 0.320 2.259 0.14 sqrt 0.043 0.069 0.63 rsqrt 0.118 0.117 1.01 abs 0.038 0.037 1.03 ceil 0.038 0.038 1.01 floor 0.039 0.039 1.00 round 0.039 0.292 0.13 trunc 0.040 0.036 1.12 lgamma 2.045 2.721 0.75 ``` Reviewed By: zheng-xq Differential Revision: D26069791 fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
Stack from ghstack:
Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels. We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).
I threw in a composed hardswish for fun, because it's my favorite activation
function.
Notably, it exposes a bug in our build process that's preventing vectorization
from using
sleef, so we're using scalar calls to libm with predictably lousyperformance. Fix incoming.
This benchmark is similar to the pure NNC approach in
microbenchmarks.py, butwill include the overhead of dispatching the fused kernel through TorchScript.
Differential Revision: D26069791
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!