Add in-place operator rsub_. by xuhdev · Pull Request #25115 · pytorch/pytorch

xuhdev · 2019-08-23T19:33:01Z

Stack from ghstack:

Add in-place operator rsub_. #25115 Add in-place operator rsub_.
Improve doc for torch.add and add doc for torch.sub. #25114 Improve doc for torch.add and add doc for torch.sub.

rsub_ offers a much better performance when one wants to perform `a = b

a`.
Benchmark shows a significant runtime improvement for this
scenario when rsub_ is available:

Script:

import timeit

for device in ('cpu', 'cuda'):
        print(device)
        for n, t in [(1000, 40_000),
                     (100_000, 4000)]:
                print(f'a = 1 - a (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'a.rsub_(1) (a.numel() == {n}) for {t} times')
                print(timeit.timeit(f'a.rsub_(1);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))
                print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for
{t} times')
                print(timeit.timeit(f'torch.sub(1, a, out=a);
torch.cuda.synchronize()',
                                    setup=f'import torch; a =
torch.rand({n}, device="{device}");',
                                    number=t))

Output:

cpu
a = 1 - a (a.numel() == 1000) for 40000 times
9.482540775090456
a.rsub_(1) (a.numel() == 1000) for 40000 times
4.980003070086241
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
6.157954938709736
a = 1 - a (a.numel() == 100000) for 4000 times
8.676277630031109
a.rsub_(1) (a.numel() == 100000) for 4000 times
8.622812658548355
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
8.741503324359655
cuda
a = 1 - a (a.numel() == 1000) for 40000 times
5.594695191830397
a.rsub_(1) (a.numel() == 1000) for 40000 times
3.868969976902008
torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times
4.848179902881384
a = 1 - a (a.numel() == 100000) for 4000 times
0.5615557432174683
a.rsub_(1) (a.numel() == 100000) for 4000 times
0.3879886604845524
torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times
0.48638539388775826

Additionally:

rsub.out is kept out, because sub.out should be sufficient.
Also added torch.rsub (function variant) to match torch.sub.

Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.25126500800252 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.967831622809172 a = 1 - a (a.numel() == 100000) for 4000 times 8.675457138568163 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.616496190428734 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.577398162335157 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.886684477329254 a = 1 - a (a.numel() == 100000) for 4000 times 0.5570489168167114 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3887205272912979 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub.

ssnl · 2019-08-23T19:48:51Z

This is a bit unintuitive... Is torch.sub(b, a, out=a) slow?

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.25126500800252 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.967831622809172 a = 1 - a (a.numel() == 100000) for 4000 times 8.675457138568163 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.616496190428734 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.577398162335157 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.886684477329254 a = 1 - a (a.numel() == 100000) for 4000 times 0.5570489168167114 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3887205272912979 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub.

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.25126500800252 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.967831622809172 a = 1 - a (a.numel() == 100000) for 4000 times 8.675457138568163 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.616496190428734 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.577398162335157 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.886684477329254 a = 1 - a (a.numel() == 100000) for 4000 times 0.5570489168167114 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3887205272912979 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 2f283dd Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.25126500800252 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.967831622809172 a = 1 - a (a.numel() == 100000) for 4000 times 8.675457138568163 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.616496190428734 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.577398162335157 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.886684477329254 a = 1 - a (a.numel() == 100000) for 4000 times 0.5570489168167114 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3887205272912979 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub.

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.25126500800252 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.967831622809172 a = 1 - a (a.numel() == 100000) for 4000 times 8.675457138568163 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.616496190428734 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.577398162335157 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.886684477329254 a = 1 - a (a.numel() == 100000) for 4000 times 0.5570489168167114 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3887205272912979 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 8952934 Pull Request resolved: #25115

xuhdev · 2019-08-27T18:18:16Z

This is a bit unintuitive... Is torch.sub(b, a, out=a) slow?

@ssnl I updated the benchmark to include your suggestion -- yes, it is often slower. But I'm also puzzled by this.

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 20a7704 Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 3c7312f Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 0260645 Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 8fcc80c Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: 225a258 Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: a6b5f29 Pull Request resolved: #25115

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. Pull Request resolved: #25115 [ghstack-poisoned]

rsub_ offers a much better performance when one wants to perform `a = b - a`. Benchmark shows a significant runtime improvement for this scenario when rsub_ is available: Script: ```python import timeit for device in ('cpu', 'cuda'): print(device) for n, t in [(1000, 40_000), (100_000, 4000)]: print(f'a = 1 - a (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a=1-a; torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'a.rsub_(1) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'a.rsub_(1); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) print(f'torch.sub(1, a, out=a) (a.numel() == {n}) for {t} times') print(timeit.timeit(f'torch.sub(1, a, out=a); torch.cuda.synchronize()', setup=f'import torch; a = torch.rand({n}, device="{device}");', number=t)) ``` Output: ``` cpu a = 1 - a (a.numel() == 1000) for 40000 times 9.482540775090456 a.rsub_(1) (a.numel() == 1000) for 40000 times 4.980003070086241 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 6.157954938709736 a = 1 - a (a.numel() == 100000) for 4000 times 8.676277630031109 a.rsub_(1) (a.numel() == 100000) for 4000 times 8.622812658548355 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 8.741503324359655 cuda a = 1 - a (a.numel() == 1000) for 40000 times 5.594695191830397 a.rsub_(1) (a.numel() == 1000) for 40000 times 3.868969976902008 torch.sub(1, a, out=a) (a.numel() == 1000) for 40000 times 4.848179902881384 a = 1 - a (a.numel() == 100000) for 4000 times 0.5615557432174683 a.rsub_(1) (a.numel() == 100000) for 4000 times 0.3879886604845524 torch.sub(1, a, out=a) (a.numel() == 100000) for 4000 times 0.48638539388775826 ``` Additionally: - rsub.out is kept out, because sub.out should be sufficient. - Also added torch.rsub (function variant) to match torch.sub. ghstack-source-id: ce0fbb5 Pull Request resolved: #25115

pytorchbot added module: docs Related to our documentation, both in docs/ and docblocks module: internals Related to internal abstractions in c10 and ATen module: operators labels Aug 23, 2019

xuhdev mentioned this pull request Aug 23, 2019

Improve doc for torch.add and add doc for torch.sub. #25114

Closed

xuhdev requested review from dzhulgakov, gchanan, izdeby and zdevito August 27, 2019 20:06

zdevito removed their request for review August 27, 2019 20:15

ezyang added the open source label Sep 18, 2019

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 11, 2019

gchanan removed their request for review June 3, 2020 15:52

facebook-github-bot closed this Jul 13, 2020

facebook-github-bot deleted the gh/xuhdev/30/head branch August 13, 2020 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-place operator rsub_.#25115

Add in-place operator rsub_.#25115
xuhdev wants to merge 14 commits intogh/xuhdev/30/basefrom
gh/xuhdev/30/head

xuhdev commented Aug 23, 2019 •

edited

Loading

Uh oh!

ssnl commented Aug 23, 2019

Uh oh!

xuhdev commented Aug 27, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

xuhdev commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Aug 23, 2019

Uh oh!

xuhdev commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xuhdev commented Aug 23, 2019 •

edited

Loading

xuhdev commented Aug 27, 2019 •

edited

Loading