Skip to content

Migrate acos from TH to ATen (CUDA)#29323

Closed
xuhdev wants to merge 3 commits intogh/xuhdev/51/basefrom
gh/xuhdev/51/head
Closed

Migrate acos from TH to ATen (CUDA)#29323
xuhdev wants to merge 3 commits intogh/xuhdev/51/basefrom
gh/xuhdev/51/head

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Nov 6, 2019

Stack from ghstack:

Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))

Before:

torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104

After:

torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659

Close #24532

Differential Revision: D18406806

Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Nov 6, 2019
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

ghstack-source-id: cb04416
Pull Request resolved: #29323
@xuhdev xuhdev requested a review from VitalyFedyunin November 6, 2019 20:53
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Nov 6, 2019
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

ghstack-source-id: 1993d43
Pull Request resolved: #29323
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Nov 7, 2019
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

ghstack-source-id: b28c3d4
Pull Request resolved: #29323
zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 9, 2019
Summary:
Pull Request resolved: pytorch/pytorch#29323

Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```

Before:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```

After:

```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```

Close #24532

Test Plan: Imported from OSS

Differential Revision: D18406806

Pulled By: VitalyFedyunin

fbshipit-source-id: 2d012485f4747fae0ddbcf2e08b1d75ef5274a19
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 6c02067.

@facebook-github-bot facebook-github-bot deleted the gh/xuhdev/51/head branch November 13, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants