Skip to content

Vectorize SmoothL1Loss forward (CPU)#37115

Closed
xuhdev wants to merge 6 commits intogh/xuhdev/72/basefrom
gh/xuhdev/72/head
Closed

Vectorize SmoothL1Loss forward (CPU)#37115
xuhdev wants to merge 6 commits intogh/xuhdev/72/basefrom
gh/xuhdev/72/head

Conversation

@xuhdev
Copy link
Copy Markdown
Collaborator

@xuhdev xuhdev commented Apr 22, 2020

Stack from ghstack:

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))

Results:

Before:

Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258

After:

Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666

Differential Revision: D21351860

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

[ghstack-poisoned]
@xuhdev xuhdev requested a review from andreaskoepf April 22, 2020 23:42
@xuhdev xuhdev changed the title Vectorize SmoothL1Loss forward. Vectorize SmoothL1Loss forward (CPU) Apr 22, 2020
@xuhdev xuhdev requested a review from VitalyFedyunin April 22, 2020 23:45
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 22, 2020

💊 Build failures summary and remediations

As of commit 399402e (more details on the Dr. CI page):


None of the build failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 22 times.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

Code looks good, please fix clang errors

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

[ghstack-poisoned]
@xuhdev
Copy link
Copy Markdown
Collaborator Author

xuhdev commented Apr 23, 2020

@VitalyFedyunin Done! (The rest of clang errors is not related to this PR)

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Apr 23, 2020
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

ghstack-source-id: 3ffdcbb
Pull Request resolved: #37115
@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 23, 2020
Copy link
Copy Markdown
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It is a bit irritating that no unit tests caught the inverse z < one_vec conditional of the first check-ins. Something like test_l1_loss_correct() seems to be missing for smooth_l1_loss.

@xuhdev
Copy link
Copy Markdown
Collaborator Author

xuhdev commented Apr 23, 2020

Looks good to me. It is a bit irritating that no unit tests caught the inverse z < one_vec conditional of the first check-ins. Something like test_l1_loss_correct() seems to be missing for smooth_l1_loss.

@andreaskoepf
test_l1_loss_correct should be able to catch that, but I pushed another update before it reported error. Note that all test results have been updated after correcting the wrong condition.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

Can you rebase to master

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```

Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request May 1, 2020
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

ghstack-source-id: b76a9a7
Pull Request resolved: #37115
@xuhdev
Copy link
Copy Markdown
Collaborator Author

xuhdev commented May 1, 2020

@VitalyFedyunin Done

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

Please rebase, fails internal merge with

The following files could not be merged:
aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```

Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request May 5, 2020
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

ghstack-source-id: 0e4436e
Pull Request resolved: #37115
@xuhdev
Copy link
Copy Markdown
Collaborator Author

xuhdev commented May 5, 2020

Done

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```

Differential Revision: [D21351860](https://our.internmc.facebook.com/intern/diff/D21351860)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request May 6, 2020
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.7517106710001826
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.734853860005387
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9275081039959332
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.8772386749915313
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.3512494120077463
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.3484660190006252
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5543672500061803
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5557958419958595
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.1945357249933295
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.186975386997801
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.300515823008027
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3039640819915803
```

ghstack-source-id: de6ada1
Pull Request resolved: #37115
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@VitalyFedyunin merged this pull request in eac54f1.

@facebook-github-bot facebook-github-bot deleted the gh/xuhdev/72/head branch May 17, 2020 14:18
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#37115

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```

Test Plan: Imported from OSS

Differential Revision: D21351860

Pulled By: VitalyFedyunin

fbshipit-source-id: b19ca1e58586d964972e5c495aba10c8808cd747
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants