Vectorize SmoothL1Loss forward (CPU)#37115
Conversation
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
[ghstack-poisoned]
💊 Build failures summary and remediationsAs of commit 399402e (more details on the Dr. CI page): ✅ None of the build failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet: 🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
|
Code looks good, please fix clang errors |
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
[ghstack-poisoned]
|
@VitalyFedyunin Done! (The rest of clang errors is not related to this PR) |
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
[ghstack-poisoned]
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
ghstack-source-id: 3ffdcbb
Pull Request resolved: #37115
There was a problem hiding this comment.
Looks good to me. It is a bit irritating that no unit tests caught the inverse z < one_vec conditional of the first check-ins. Something like test_l1_loss_correct() seems to be missing for smooth_l1_loss.
@andreaskoepf |
|
Can you rebase to master |
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```
Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)
[ghstack-poisoned]
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
ghstack-source-id: b76a9a7
Pull Request resolved: #37115
|
@VitalyFedyunin Done |
|
Please rebase, fails internal merge with |
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```
Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)
[ghstack-poisoned]
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
ghstack-source-id: 0e4436e
Pull Request resolved: #37115
|
Done |
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```
Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)
[ghstack-poisoned]
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```
ghstack-source-id: de6ada1
Pull Request resolved: #37115
|
@VitalyFedyunin merged this pull request in eac54f1. |
Summary: Pull Request resolved: pytorch#37115 Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz): ```python import timeit for op in ('SmoothL1Loss',): print('Forward') for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'): for n, t in [(10_000, 100000), (100_000, 10000)]: print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t)) print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t)) ``` Results: Before: ``` Forward torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double 2.8427017140056705 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double 2.823863306999556 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double 0.9239509999897564 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double 0.9014650480094133 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float 2.4530331650021253 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float 2.4551637870026752 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float 0.5716871829936281 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float 0.5748704470024677 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16 9.777982015002635 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16 12.627838339001755 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16 7.810075458997744 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16 10.73597132100258 ``` After: ``` Forward torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double 2.8420191049808636 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double 2.8814279660000466 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double 0.9491433810035232 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double 0.9144560259883292 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float 2.4458729829930235 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float 2.4474395569995977 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float 0.5676976410031784 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float 0.5793530470109545 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16 4.32380092900712 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16 4.332892568985699 torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16 2.3354615129937883 torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16 2.3352111729909666 ``` Test Plan: Imported from OSS Differential Revision: D21351860 Pulled By: VitalyFedyunin fbshipit-source-id: b19ca1e58586d964972e5c495aba10c8808cd747
Stack from ghstack:
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
Results:
Before:
After:
Differential Revision: D21351860