Skip to content

3x perf slow down in nightly build Torch 2.0.0.dev2023xxxx+cu118 #92288

@aifartist

Description

@aifartist

🐛 Describe the bug

This github repo doesn't have a discussions tab like automatic1111 has so I'll use this. Forgive me if this is wrong.

Stable Diffusion A1111 image generation using typical defaults like 20 steps euler_a simple prompts, sd 2.1 512 model.
Using the Linux nightly torch 2.0 on my 4090 only gives about 11 to 13 it/s.
With the WIndows nightly torch 2.0 build a 4090 gives about 35 to 38 it/s.
I have multiple confirmations of this from other folks.

However if you build pytorch locally on Linux you get about a 3X perf increase to the same perf seen on WIndows.
Today an ex-cto of a cloud company with GPU resources contacted me to try this on one of his cloud server he loaned me. It also sped up his 4090 and he will test an A4000 GPU tomorrow.

As a suggestion you might check if architecture sm_89 is one of the selected architectures listed in the linux build output.
If there was a simply py inference perf test I'd be willing to run it as a "repro" but my REPRO is the entirety of SD AUTOMATIC1111. I have no simple stand alone pytorch perf test. Let me know how I can help. Good night.

Versions

CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.17.0-1019-oem-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA Graphics Device
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.3
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.7.6
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230113+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==0.11.0
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.0.dev20230116+cu118
[conda] Could not collect
~

cc @ezyang @gchanan @zou3519 @ngimel @peterjc123 @mszhanyi @skyline75489 @nbcsm @csarofeen @ptrblck @xwang233 @seemethere @malfet

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: cudnnRelated to torch.backends.cudnn, and CuDNN supportmodule: performanceIssues related to performance, either of kernel code or framework gluemodule: windowsWindows support for PyTorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions