Using a Subclassed Tensor Results in Significant Decrease in Training GPU Throughput

### 🐛 Describe the bug

Using a subclassed tensor results in a significant decrease in training GPU throughput. Some ~150-400 images/second using a torchvision ResNet50 depending on GPU and settings.

@tcapelle and I discovered that using a simple passthrough `torch.Tensor` subclass:

```python
class SubClassedTensor(torch.Tensor): 
    pass
```
results in a ~150-200 images/second decrease in GPU throughput across Volta and Ampere generations compared to a training step on a `torch.Tensor` batch. Using channels last format, the throughput difference increases to ~370-400 images/second on a 3080 Ti. Both examples use a torchvision ResNet50, 224px image size, batch size of 64, and mixed precision.

The performance decrease appears to happen for any subclassed tensor, including fastai's [TensorBase](https://docs.fast.ai/torch_core.html#TensorBase) derived tensors (fastai has a workaround in progress for this issue).

Our expectation is that training with subclassed tensors would have the same performance as `torch.Tensor`.

Our training script can be found [here](https://github.com/tcapelle/ch_last/blob/main/train_pets.py). We also created [charts](https://wandb.ai/fastai/channels_last/reports/Subclassed-Tensor-vs-torch-Tensor-GPU-Throughput--VmlldzoyMTQ4OTkw) from our logged test runs.

@tcapelle can correct me if I am wrong, but I believe our V100 runs were using the latest pytorch 1.11 docker image.

I embedded one of the training logs below.

![image](https://user-images.githubusercontent.com/51142400/173154712-4e277f28-dc44-4e94-a35d-53950e98edfd.png)

### Versions

The Ampere 3080 Ti runs used the following versions.

PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti
Nvidia driver version: 510.60.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.11.0
[pip3] torchelastic==0.2.2
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.21.2           py38h20f2e39_0  
[conda] numpy-base                1.21.2           py38h79a1101_0  
[conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchtext                 0.12.0                     py38    pytorch
[conda] torchvision               0.12.0               py38_cu113    pytorch

cc @VitalyFedyunin @ngimel @ezyang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a Subclassed Tensor Results in Significant Decrease in Training GPU Throughput #79321

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Using a Subclassed Tensor Results in Significant Decrease in Training GPU Throughput #79321

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions