Skip to content

Add in-place operator rsub_.#25115

Closed
xuhdev wants to merge 14 commits intogh/xuhdev/30/basefrom
gh/xuhdev/30/head
Closed

Add in-place operator rsub_.#25115
xuhdev wants to merge 14 commits intogh/xuhdev/30/basefrom
gh/xuhdev/30/head

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Aug 23, 2019

Stack from ghstack:

rsub_ offers a much better performance when one wants to perform `a = b

  • a`.
    Benchmark shows a significant runtime improvement for this
    scenario when rsub_ is available:

Script:

import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))

Output:

cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826

Additionally:

  • rsub.out is kept out, because sub.out should be sufficient.
  • Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b
- a`. Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.25126500800252
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.967831622809172
a = 1 - a (a.numel() == 100000) for 4000 times
8.675457138568163
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.616496190428734
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.577398162335157
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.886684477329254
a = 1 - a (a.numel() == 100000) for 4000 times
0.5570489168167114
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3887205272912979
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.
@pytorchbot pytorchbot added module: docs Related to our documentation, both in docs/ and docblocks module: internals Related to internal abstractions in c10 and ATen module: operators labels Aug 23, 2019
@ssnl
Copy link
Collaborator

ssnl commented Aug 23, 2019

This is a bit unintuitive... Is torch.sub(b, a, out=a) slow?

rsub_ offers a much better performance when one wants to perform `a = b - a`. 
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.25126500800252
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.967831622809172
a = 1 - a (a.numel() == 100000) for 4000 times
8.675457138568163
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.616496190428734
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.577398162335157
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.886684477329254
a = 1 - a (a.numel() == 100000) for 4000 times
0.5570489168167114
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3887205272912979
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.
xuhdev added a commit that referenced this pull request Aug 23, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`. Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.25126500800252
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.967831622809172
a = 1 - a (a.numel() == 100000) for 4000 times
8.675457138568163
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.616496190428734
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.577398162335157
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.886684477329254
a = 1 - a (a.numel() == 100000) for 4000 times
0.5570489168167114
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3887205272912979
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 2f283dd
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b - a`. 
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.25126500800252
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.967831622809172
a = 1 - a (a.numel() == 100000) for 4000 times
8.675457138568163
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.616496190428734
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.577398162335157
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.886684477329254
a = 1 - a (a.numel() == 100000) for 4000 times
0.5570489168167114
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3887205272912979
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.
xuhdev added a commit that referenced this pull request Aug 26, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`. Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.25126500800252
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.967831622809172
a = 1 - a (a.numel() == 100000) for 4000 times
8.675457138568163
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.616496190428734
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.577398162335157
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.886684477329254
a = 1 - a (a.numel() == 100000) for 4000 times
0.5570489168167114
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3887205272912979
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 8952934
Pull Request resolved: #25115
@xuhdev
Copy link
Collaborator Author

xuhdev commented Aug 27, 2019

This is a bit unintuitive... Is torch.sub(b, a, out=a) slow?

@ssnl I updated the benchmark to include your suggestion -- yes, it is often slower. But I'm also puzzled by this.

rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
@zdevito zdevito removed their request for review August 27, 2019 20:15
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 27, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 20a7704
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 27, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 3c7312f
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 28, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 0260645
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 29, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 8fcc80c
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 29, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: 225a258
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115
xuhdev added a commit that referenced this pull request Aug 31, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: a6b5f29
Pull Request resolved: #25115
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Oct 7, 2019
rsub_ offers a much better performance when one wants to perform `a = b
- a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

```python
import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
```

Output:

```
cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826
```

Additionally:

- rsub.out is kept out, because sub.out should be sufficient.
- Also added torch.rsub (function variant) to match torch.sub.

ghstack-source-id: ce0fbb5
Pull Request resolved: #25115
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 11, 2019
@gchanan gchanan removed their request for review June 3, 2020 15:52
@facebook-github-bot facebook-github-bot deleted the gh/xuhdev/30/head branch August 13, 2020 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: docs Related to our documentation, both in docs/ and docblocks module: internals Related to internal abstractions in c10 and ATen open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants