Skip to content

Migrate var & std to ATen #39967

Closed
ShawnZhong wants to merge 1 commit intopytorch:masterfrom
ShawnZhong:std_var
Closed

Migrate var & std to ATen #39967
ShawnZhong wants to merge 1 commit intopytorch:masterfrom
ShawnZhong:std_var

Conversation

@ShawnZhong
Copy link
Copy Markdown
Contributor

@ShawnZhong ShawnZhong commented Jun 12, 2020

Not sure why there are so many issues for std & var, but this PR should close them all:
std: Fix #24771, Fix #24676, Fix #24639, Fix #24529
var: Fix #24782, Fix #24677, Fix #24652, Fix #24530

import time
import torch

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

for device in (torch.device("cpu"), torch.device("cuda")):
    for size in (
        [100000000],
        [10000, 10000],
        [1000, 1000, 100],
        [100, 100, 100, 100],
    ):
        t = torch.randn(*size, device=device)
        total_time = 0
        for i in range(10):
            t1 = _time()
            t.std()
            t2 = _time()
            total_time += t2 - t1
        print(f"Tensor of size {size} on {device}: {total_time / 10}")

Before:

Tensor of size [100000000] on cpu: 0.36041643619537356
Tensor of size [10000, 10000] on cpu: 0.37235140800476074
Tensor of size [1000, 1000, 100] on cpu: 0.386572527885437
Tensor of size [100, 100, 100, 100] on cpu: 0.37404844760894773
Tensor of size [100000000] on cuda: 0.0021645784378051757
Tensor of size [10000, 10000] on cuda: 0.002090191841125488
Tensor of size [1000, 1000, 100] on cuda: 0.00208127498626709
Tensor of size [100, 100, 100, 100] on cuda: 0.0020844221115112306

After:

Tensor of size [100000000] on cpu: 0.1339871883392334
Tensor of size [10000, 10000] on cpu: 0.1343991994857788
Tensor of size [1000, 1000, 100] on cpu: 0.1346735954284668
Tensor of size [100, 100, 100, 100] on cpu: 0.11906447410583496
Tensor of size [100000000] on cuda: 0.0013531208038330077
Tensor of size [10000, 10000] on cuda: 0.0012922048568725585
Tensor of size [1000, 1000, 100] on cuda: 0.001285886764526367
Tensor of size [100, 100, 100, 100] on cuda: 0.0012899160385131836

cc: @VitalyFedyunin

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Jun 12, 2020

💊 CI failures summary and remediations

As of commit 552f160 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_nn failed!
  File "C:\Jenkins\Miniconda3\lib\site-packages\scipy\_distributor_init.py", line 61, in <module> 
    WinDLL(os.path.abspath(filename)) 
  File "C:\Jenkins\Miniconda3\lib\ctypes\__init__.py", line 348, in __init__ 
    self._handle = _dlopen(self._name, mode) 
OSError: [WinError 126] The specified module could not be found 
Traceback (most recent call last): 
  File "run_test.py", line 726, in <module> 
    main() 
  File "run_test.py", line 719, in main 
    raise RuntimeError(message) 
RuntimeError: test_nn failed! 
 
(base) circleci@PACKER-5ECD3242 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (2/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_cuda failed!
  File "C:\Jenkins\Miniconda3\lib\site-packages\scipy\_distributor_init.py", line 61, in <module> 
    WinDLL(os.path.abspath(filename)) 
  File "C:\Jenkins\Miniconda3\lib\ctypes\__init__.py", line 348, in __init__ 
    self._handle = _dlopen(self._name, mode) 
OSError: [WinError 126] The specified module could not be found 
Traceback (most recent call last): 
  File "run_test.py", line 726, in <module> 
    main() 
  File "run_test.py", line 719, in main 
    raise RuntimeError(message) 
RuntimeError: test_cuda failed! 
 
(base) circleci@PACKER-5EE89583 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_on_cpu_test1 (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_nn failed!
  File "C:\Jenkins\Miniconda3\lib\site-packages\scipy\_distributor_init.py", line 61, in <module> 
    WinDLL(os.path.abspath(filename)) 
  File "C:\Jenkins\Miniconda3\lib\ctypes\__init__.py", line 348, in __init__ 
    self._handle = _dlopen(self._name, mode) 
OSError: [WinError 126] The specified module could not be found 
Traceback (most recent call last): 
  File "run_test.py", line 726, in <module> 
    main() 
  File "run_test.py", line 719, in main 
    raise RuntimeError(message) 
RuntimeError: test_nn failed! 
 
(base) circleci@PACKER-5EE89590 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

See CircleCI build pytorch_windows_vs2019_py36_cpu_test1 (4/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_nn failed!
  File "C:\Jenkins\Miniconda3\lib\site-packages\scipy\_distributor_init.py", line 61, in <module> 
    WinDLL(os.path.abspath(filename)) 
  File "C:\Jenkins\Miniconda3\lib\ctypes\__init__.py", line 348, in __init__ 
    self._handle = _dlopen(self._name, mode) 
OSError: [WinError 126] The specified module could not be found 
Traceback (most recent call last): 
  File "run_test.py", line 726, in <module> 
    main() 
  File "run_test.py", line 719, in main 
    raise RuntimeError(message) 
RuntimeError: test_nn failed! 
 
(base) circleci@PACKER-5EE89590 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 31 times.

@ShawnZhong ShawnZhong force-pushed the std_var branch 2 times, most recently from eaacf05 to ac79193 Compare June 12, 2020 23:20
@ShawnZhong ShawnZhong changed the title [WIP] Migrate var & std to ATen [WIP][DO NOT REVIEW] Migrate var & std to ATen Jun 12, 2020
@ShawnZhong ShawnZhong changed the title [WIP][DO NOT REVIEW] Migrate var & std to ATen Migrate var & std to ATen Jun 13, 2020
@ShawnZhong ShawnZhong marked this pull request as ready for review June 13, 2020 03:10
@ShawnZhong ShawnZhong marked this pull request as draft June 13, 2020 03:27
@ShawnZhong ShawnZhong marked this pull request as ready for review June 13, 2020 06:29
@VitalyFedyunin VitalyFedyunin self-requested a review June 22, 2020 00:27
Copy link
Copy Markdown
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@robieta
Copy link
Copy Markdown
Contributor

robieta commented Jun 23, 2020

@ShawnZhong @VitalyFedyunin @ngimel
This PR significantly regresses single threaded CPU performance. (I also see speedups when multi-threading is enabled.) Results from Shawn's script with torch.set_num_threads(1) at the start:

c4fc278

Tensor of size [100000000] on cpu: 0.18599200248718262
Tensor of size [10000, 10000] on cpu: 0.17864339351654052
Tensor of size [1000, 1000, 100] on cpu: 0.1743138313293457
Tensor of size [100, 100, 100, 100] on cpu: 0.1824030637741089
Tensor of size [100000000] on cuda: 0.0017017841339111329
Tensor of size [10000, 10000] on cuda: 0.001610112190246582
Tensor of size [1000, 1000, 100] on cuda: 0.00161285400390625
Tensor of size [100, 100, 100, 100] on cuda: 0.0016181468963623047

7a3c223 (This PR)

Tensor of size [100000000] on cpu: 0.8823971509933471
Tensor of size [10000, 10000] on cpu: 0.8826733112335206
Tensor of size [1000, 1000, 100] on cpu: 0.8823000669479371
Tensor of size [100, 100, 100, 100] on cpu: 0.882302713394165
Tensor of size [100000000] on cuda: 0.0011995553970336914
Tensor of size [10000, 10000] on cuda: 0.001114821434020996
Tensor of size [1000, 1000, 100] on cuda: 0.0011154413223266602
Tensor of size [100, 100, 100, 100] on cuda: 0.001114058494567871

My testing on GPU agrees that this is generally an improvement, though there are some cases with regressions. (#38338 will soon be updated with the script that I used to benchmark this PR.)

Unfortunately, we may need to revert this PR since the impact on single threaded CPU speed is quite severe.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

@robieta sounds reasonable, let me revert it first, and after we can or (if quick) fix single thread, or at least apply GPU part first.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Not sure why there are so many issues for std & var, but this PR should close them all:
std: Fix pytorch#24771, Fix pytorch#24676, Fix pytorch#24639, Fix pytorch#24529
var: Fix pytorch#24782, Fix pytorch#24677, Fix pytorch#24652, Fix pytorch#24530

```py
import time
import torch

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

for device in (torch.device("cpu"), torch.device("cuda")):
    for size in (
        [100000000],
        [10000, 10000],
        [1000, 1000, 100],
        [100, 100, 100, 100],
    ):
        t = torch.randn(*size, device=device)
        total_time = 0
        for i in range(10):
            t1 = _time()
            t.std()
            t2 = _time()
            total_time += t2 - t1
        print(f"Tensor of size {size} on {device}: {total_time / 10}")
```

Before:
```
Tensor of size [100000000] on cpu: 0.36041643619537356
Tensor of size [10000, 10000] on cpu: 0.37235140800476074
Tensor of size [1000, 1000, 100] on cpu: 0.386572527885437
Tensor of size [100, 100, 100, 100] on cpu: 0.37404844760894773
Tensor of size [100000000] on cuda: 0.0021645784378051757
Tensor of size [10000, 10000] on cuda: 0.002090191841125488
Tensor of size [1000, 1000, 100] on cuda: 0.00208127498626709
Tensor of size [100, 100, 100, 100] on cuda: 0.0020844221115112306
```

After:
```
Tensor of size [100000000] on cpu: 0.1339871883392334
Tensor of size [10000, 10000] on cpu: 0.1343991994857788
Tensor of size [1000, 1000, 100] on cpu: 0.1346735954284668
Tensor of size [100, 100, 100, 100] on cpu: 0.11906447410583496
Tensor of size [100000000] on cuda: 0.0013531208038330077
Tensor of size [10000, 10000] on cuda: 0.0012922048568725585
Tensor of size [1000, 1000, 100] on cuda: 0.001285886764526367
Tensor of size [100, 100, 100, 100] on cuda: 0.0012899160385131836
```

cc: VitalyFedyunin
Pull Request resolved: pytorch#39967

Differential Revision: D22162469

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d901c779767b00f81cd6231bc665e04f297b4c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants