Skip to content

Migrate frac from TH to ATen (CUDA)#28953

Closed
xuhdev wants to merge 5 commits intogh/xuhdev/47/basefrom
gh/xuhdev/47/head
Closed

Migrate frac from TH to ATen (CUDA)#28953
xuhdev wants to merge 5 commits intogh/xuhdev/47/basefrom
gh/xuhdev/47/head

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Oct 31, 2019

Stack from ghstack:

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))

Before:

torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572

After:

torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311

Differential Revision: D18302768

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Oct 31, 2019
Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

ghstack-source-id: 5f9c0f9
Pull Request resolved: #28953
@xuhdev xuhdev requested a review from VitalyFedyunin October 31, 2019 06:00
@xuhdev xuhdev added module: operators module: cuda Related to torch.cuda, and CUDA support in general labels Oct 31, 2019
@kostmo
Copy link
Member

kostmo commented Oct 31, 2019

CircleCI build failures summary

As of commit 4a5421e:

  • 0/4 flaky

Here are the reasons each build failed.


This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 time(s).

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Oct 31, 2019
Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

ghstack-source-id: a8cea62
Pull Request resolved: #28953
@xuhdev
Copy link
Collaborator Author

xuhdev commented Nov 6, 2019

@VitalyFedyunin Are you merging this? I'm going to rebase

@VitalyFedyunin
Copy link
Contributor

Go ahead with rebase

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

Differential Revision: [D18302768](https://our.internmc.facebook.com/intern/diff/D18302768)

[ghstack-poisoned]
@xuhdev
Copy link
Collaborator Author

xuhdev commented Nov 6, 2019

Done

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

Differential Revision: [D18302768](https://our.internmc.facebook.com/intern/diff/D18302768)

[ghstack-poisoned]
Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

Differential Revision: [D18302768](https://our.internmc.facebook.com/intern/diff/D18302768)

[ghstack-poisoned]
zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 9, 2019
Summary:
Pull Request resolved: pytorch/pytorch#28953

Close #24566

Benchmark (Debian Buster, CUDA 9.2, Quadro P400, turbo off, Release, gcc
7.4):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.frac(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.frac(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3608182370007853
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3647012189976522
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.3889585220022127
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.622635444997286
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9595754649999435
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5590267750012572
```

After:

```
torch.frac(a) a.numel() == 10000 for 20000 times torch.half
0.3675256470014574
torch.frac(a) a.numel() == 10000 for 20000 times torch.float
0.3703597319981782
torch.frac(a) a.numel() == 10000 for 20000 times torch.double
0.372184894993552
torch.frac(a) a.numel() == 100000 for 20000 times torch.half
0.60767333900003
torch.frac(a) a.numel() == 100000 for 20000 times torch.float
0.9645607889979146
torch.frac(a) a.numel() == 100000 for 20000 times torch.double
1.5542530329985311
```

Test Plan: Imported from OSS

Differential Revision: D18302768

Pulled By: VitalyFedyunin

fbshipit-source-id: 24198838dc903d455155f0819d0c7d58974aaecd
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 4606deb.

@facebook-github-bot facebook-github-bot deleted the gh/xuhdev/47/head branch November 13, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants