🐛 Describe the bug
Using a subclassed tensor results in a significant decrease in training GPU throughput. Some ~150-400 images/second using a torchvision ResNet50 depending on GPU and settings.
@tcapelle and I discovered that using a simple passthrough torch.Tensor subclass:
class SubClassedTensor(torch.Tensor):
pass
results in a ~150-200 images/second decrease in GPU throughput across Volta and Ampere generations compared to a training step on a torch.Tensor batch. Using channels last format, the throughput difference increases to ~370-400 images/second on a 3080 Ti. Both examples use a torchvision ResNet50, 224px image size, batch size of 64, and mixed precision.
The performance decrease appears to happen for any subclassed tensor, including fastai's TensorBase derived tensors (fastai has a workaround in progress for this issue).
Our expectation is that training with subclassed tensors would have the same performance as torch.Tensor.
Our training script can be found here. We also created charts from our logged test runs.
@tcapelle can correct me if I am wrong, but I believe our V100 runs were using the latest pytorch 1.11 docker image.
I embedded one of the training logs below.

Versions
The Ampere 3080 Ti runs used the following versions.
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti
Nvidia driver version: 510.60.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.11.0
[pip3] torchelastic==0.2.2
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.21.2 py38h20f2e39_0
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch 1.11.0 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchtext 0.12.0 py38 pytorch
[conda] torchvision 0.12.0 py38_cu113 pytorch
cc @VitalyFedyunin @ngimel @ezyang
🐛 Describe the bug
Using a subclassed tensor results in a significant decrease in training GPU throughput. Some ~150-400 images/second using a torchvision ResNet50 depending on GPU and settings.
@tcapelle and I discovered that using a simple passthrough
torch.Tensorsubclass:results in a ~150-200 images/second decrease in GPU throughput across Volta and Ampere generations compared to a training step on a
torch.Tensorbatch. Using channels last format, the throughput difference increases to ~370-400 images/second on a 3080 Ti. Both examples use a torchvision ResNet50, 224px image size, batch size of 64, and mixed precision.The performance decrease appears to happen for any subclassed tensor, including fastai's TensorBase derived tensors (fastai has a workaround in progress for this issue).
Our expectation is that training with subclassed tensors would have the same performance as
torch.Tensor.Our training script can be found here. We also created charts from our logged test runs.
@tcapelle can correct me if I am wrong, but I believe our V100 runs were using the latest pytorch 1.11 docker image.
I embedded one of the training logs below.
Versions
The Ampere 3080 Ti runs used the following versions.
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti
Nvidia driver version: 510.60.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.11.0
[pip3] torchelastic==0.2.2
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.21.2 py38h20f2e39_0
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch 1.11.0 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchtext 0.12.0 py38 pytorch
[conda] torchvision 0.12.0 py38_cu113 pytorch
cc @VitalyFedyunin @ngimel @ezyang