Bad performance with python threads

## 🐛 Bug

concurrent.futures.ThreadPoolExecutor with max_workers=1 loads all CPU cores (24 cores / 100% load) when torch.zeros(64, 8, 2, 128) is used inside it's thread. If torch.zeros used with smaller size, like (16, 8, 2, 128), then there is no performance hit. Also this sometimes happens with torch.logsumexp, tensor indexing, tensor copying.
Edit: multiprocessing.dummy.Pool is also affected

## To Reproduce

Run this code:
```
import torch
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=1)

def train_async(data):
    with torch.no_grad():
        torch.zeros(64, 8, 2, 128)
            
while True:
    executor.submit(train_async, []).result()
```

## Expected behavior

Only single CPU core being utilized.

## Environment

```
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Њ ©Єа®б®дв Windows 10 Pro
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 442.19
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.17.4
[pip] numpydoc==0.9.1
[pip] ppo-pytorch==0.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h74a9793_0
[conda] mkl                       2019.4                      245
[conda] mkl-service               2.3.0            py37hb782905_0
[conda] mkl_fft                   1.0.15           py37h14836fe_0
[conda] mkl_random                1.1.0            py37h675688f_0
[conda] numpy                     1.17.4           py37h4320e6b_0
[conda] numpy-base                1.17.4           py37hc3f5095_0
[conda] numpydoc                  0.9.1                      py_0
[conda] ppo-pytorch               0.1                       dev_0    <develop>
[conda] pytorch                   1.5.0           py3.7_cuda101_cudnn7_0    pytorch
[conda] pytorch-qrnn              0.2.1                    pypi_0    pypi
[conda] torchvision               0.6.0                py37_cu101    pytorch
```


cc @ezyang @gchanan @zou3519 @VitalyFedyunin @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance with python threads #37259

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad performance with python threads #37259

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions