Skip to content

update torch base environment#9191

Merged
AUTOMATIC1111 merged 5 commits intoAUTOMATIC1111:devfrom
vladmandic:torch
Apr 29, 2023
Merged

update torch base environment#9191
AUTOMATIC1111 merged 5 commits intoAUTOMATIC1111:devfrom
vladmandic:torch

Conversation

@vladmandic
Copy link
Copy Markdown
Collaborator

@vladmandic vladmandic commented Mar 30, 2023

this pr is a single-step update of pytorch base environment:

  • from torch 1.13.1 with cuda 11.7 and cudnn 8.5.0
  • to torch 2.0.0 with cuda 11.8 and cudnn 8.7.0

this allows usage of sdp cross-optimization, better multi-gpu support with accelerate and avoids large number of performance issues due to broken cudnn in some environments

it updates all required packaged, but avoids any prereleases:

  • torchvision (plus silence future deprecation warning)
  • xformers (update follows torch)
  • accelerate (required to support new torch)
  • numpy (update of numpy is required by new accelerate)

note:

  • since accelerate changed format of the config file to avoid warnings (non-critical), run accelerate config once
  • collab updated to torch 2.0 so having webui still using older torch is causing issues for users running webui in hosted environments

yes, updating torch is a major step, but will have to be done sooner or later as there are more and more reports of issues installing old torch version

@@ -1,3 +1,4 @@
astunparse
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this is included?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its an indirect dependency that can cause runtime errors if user has tensorflow installed as new transformers will try to check it and then complain that astunparse is missing even if tensorflow is not used.

 File "/home/vlado/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 33, in <module>
    import tensorflow as tf
...
    from tensorflow.python.autograph.pyct import parser
  File "/home/vlado/.local/lib/python3.10/site-packages/tensorflow/python/autograph/pyct/parser.py", line 29, in <module>
    import astunparse
ModuleNotFoundError: No module named 'astunparse'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TensorFlow pip packages should automatically install astunparse as a required dependency as listed in their package metadata and same with setup.py when installing from source.

Metadata-Version: 2.1
Name: tensorflow
Version: 2.12.0
...
Requires-Dist: astunparse (>=1.6.0)
...
Metadata-Version: 2.1
Name: tensorflow-intel
Version: 2.12.0
...
Requires-Dist: astunparse (>=1.6.0)
...

So this seems redundant when the webui base itself doesn't even install TensorFlow, only some extensions do. None the less, if you feel this is really needed, it would be wise to pin the same version range as TensorFlow itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TensorFlow pip packages should automatically install astunparse as a required dependency as listed in their package metadata and same with [setup.py]

yup, but there was a buggy installer in tf 2.11 and 2.12 only came out very recently.

it would be wise to pin the same version range as TensorFlow itself.

i hate specifying version range where absolutely not needed - last version of astunparse is from 2019, its unlikely that a brand new breaking version is going to popup out-of-the-blue.

So this seems redundant when the webui base itself doesn't even install TensorFlow, only some extensions do.

very true. i'm just trying to make it as least error prone as possible for average users as i've seen this happen on multiple systems.

if you feel this is really needed

i don't - i'm just trying to make it error-proof - but i'm open to suggestions
(and removing astunparse from dependencies if desired).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a buggy installer in tf 2.11

If this is the case, then maybe astunparse could be moved to the launch.py and be conditional on tensorflow being installed? Something like:

    if is_installed("tensorflow") or is_installed("tensorflow-intel") or is_installed("tensorflow-gpu"):
        if not is_installed("astunparse"):
            run_pip(f"install {astunparse}", "astunparse")

Though astunparse is a tiny package which hasn't been updated since 2019, so it probably isn't a big deal if it gets mistakenly installed when not needed. I'll leave judgement up to you and automatic, since I have no firm opinion about this.

@glass-ships
Copy link
Copy Markdown

glass-ships commented Mar 30, 2023

Testing these changes out - things seem to work "out of the box",
but I still get the No module 'xformers'. Proceeding without it message when starting up
Not sure if this can be ignored, as it's seemingly included in torch 2

@vladmandic
Copy link
Copy Markdown
Collaborator Author

vladmandic commented Mar 30, 2023

but I still get the No module 'xformers'. Proceeding without it message when starting up

its not "included", its just not necessary given new sdp is available
(depending on the use-case, low-end gpus are still better with xformers).

the remaining message comes from external repo - repositories/stable-diffusion-stability-ai and removing warning would cause that repo to get out-of-sync. and unfortunately, its not posted with logger so it can be filtered out, but a simple print statement.

@FurkanGozukara
Copy link
Copy Markdown

have you tested these changes on unix? runpod?

@vladmandic
Copy link
Copy Markdown
Collaborator Author

vladmandic commented Mar 30, 2023

runpod

linux yes. runpod no. there are thousands of gpu cloud providers, cannot test each one like that.

@drax-xard
Copy link
Copy Markdown

Yes, at some point will have to migrate to torch 2.0 since newer xformer wheels require it.

@FurkanGozukara
Copy link
Copy Markdown

FurkanGozukara commented Mar 30, 2023

runpod

linux yes. runpod no. there are thousands of gpu cloud providers, cannot test each one like that.

ok list me 20 :)

anyway i am just saying that covering as many as possible widely used scenario is good

Yes, at some point will have to migrate to torch 2.0 since newer xformer wheels require it.

correct and i solved this problem with downloading and uploading torch 1 version wheel 0.0.18dev489. they are also still compiling them thankfully. i think automatic1111 can do same way. the wheel and such things can be hosted on hugging face i think . currently they removed all 0.0.14 and 0.0.17 for torch 1 from pip installation.

@Cyberbeing
Copy link
Copy Markdown
Contributor

Yes, at some point will have to migrate to torch 2.0 since newer xformer wheels require it.

This is true only for wheels posted to pypi. You can find a wide range of pre-built xformers wheel builds in their Github action artifacts, if you still needed a wheel for older torch. Not at simple as keeping up to date via pypi, but useful in a pinch.
Screenshot example of available builds below:
xformersartifacts

Just keep in mind you need to be logged into Github to download artifacts.

@vladmandic
Copy link
Copy Markdown
Collaborator Author

Yes, at some point will have to migrate to torch 2.0 since newer xformer wheels require it.

Just keep in mind you need to be logged into Github to download artifacts.

sorry, can we use discussions for this and keep pr comments as pr comments? i'd love to collect/implement anything thats required, but this is not pr related at all.

- cudatoolkit=11.8
- pytorch=2.0
- torchvision=0.15
- numpy=1.23
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladmandic Didn't you say Torch 2.0 requires Numpy 1.24+ instead of 1.23?

Copy link
Copy Markdown
Collaborator Author

@vladmandic vladmandic Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beta did, but they relaxed it for GA.
And some other dependencies require 1.23 and are not compatible with 1.24, so it's not that clean to use 1.24 just yet. Thus my recommendation of latest from 1.23.

@EfourC
Copy link
Copy Markdown

EfourC commented Apr 3, 2023

I'm running into a really strange problem. Any advice of how I should narrow down the root cause?

Edit: Oops, forgot to say my startup arguments:
--xformers --opt-channelslast --no-half-vae

I was just trying this PR out (as-is, plus exporting xformers==0.0.18 in launch.py).

Everything upgraded and ran smoothly for the most part, but when I tried to generate a larger image (e.g. 1024x1024), I realized there is a problem for me -- It seemed to hang at 100% GPU for 4 minutes! The sampler steps had completed, but the image had not saved to file yet. Then after 4 mins, the image finally completed and saved.

After this, the problem goes away if I generate more at the same resolution (until the WebUI process is restarted).

However, if I change the image resolution to anything different, e.g. to 1024x1088, the same very delay happens, and again only for the first run at that resolution.

After investigating, I realized there were also delays for smaller images, but the delay grows exponentially as resolution scales up.

Here is a quick table showing times I measured.
Note: These measurements also reflect the typical 'warm-up' time before steps progress that was already a little annoying. After warm-up, the gen time for small images is much faster.

First Runs w/ 5 steps (Euler a):

Gen Time = time for all steps completed

Size Gen Time Total Time
512x512 0:03 0:04
640x640 0:01 0:06
704x704 0:02 0:08
768x768 0:02 0:09
832x832 0:03 0:11
896x832 0:03 0:12
896x896 0:03 2:23
1024x1024 0:04 4:14

Here are some screenshots of the output I saw
RBCzcymn55

Qmb5T9SGTj

fk189qPC4L

System: Win10, RTX 2070S 8GB, Intel 3770k

Version info:
KGJz8vHocu

@Sakura-Luna
Copy link
Copy Markdown
Collaborator

@EfourC Very strange problem, since the recent update is not stable, you can try to see if this problem is reproduced on the old version, for example git checkout a9fed7c.

@EfourC
Copy link
Copy Markdown

EfourC commented Apr 3, 2023

With that commit, and the current master, the problem doesn't happen -- I only see it after everything is upgraded for Torch2.

Behavior Ok:
DiMNEWHhQP
MGZ5Fxp7GT

@EfourC
Copy link
Copy Markdown

EfourC commented Apr 3, 2023

I did some more permutations of testing, especially to see if --opt-sdp-attention instead of xformers made a difference with the Torch 2 venv.

What I found out is that the problem is actually --opt-channelslast causing the massive delay for me in with Torch 2.

Both of these startup args work ok:
--xformers --no-half-vae
--opt-sdp-attention --no-half-vae

Using --opt-channelslast with either of the above creates the problem delay for me.

I haven't looked at (or used previously) any of the other performance optimization switches, but it's probably worth people trying them out on different types of systems (since I blundered into an issue with this one).
https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings

@mariaWitch
Copy link
Copy Markdown

Honestly if there is going to be a move to Torch 2.0.0, it should wait until after Torch 2.0.1 is released as there is currently a major bug that made it into GA that breaks compatibility with WebUI when using torch.compile.
See: pytorch/pytorch#97862 and pytorch/pytorch#93405

@vladmandic
Copy link
Copy Markdown
Collaborator Author

I'm aware of that issue, but WebUI does not use torch.compile on its own and anyone that is experienced enough to use it would hand pick torch version manually anyhow.

Torch 2.1 has no benefits for normal WebUI user. And existing Torch 1.13 is showing it's teeth with quite a few install issues lately.

Whole point of the PR is not to enable experimental use, but to make it simpler for normal users.

@vladmandic
Copy link
Copy Markdown
Collaborator Author

fyi, i've initially updated xformers to 0.0.18, but there are frequent reports of NaN values, especially during hires operations, so i've downgraded to 0.0.17. performance wise, i don't see any major difference, so this is not a big loss. like i said before, goal of this PR is to get cleanest out-of-the-box environment where least number of users have issues, not just go with latest&greatest.

@mariaWitch
Copy link
Copy Markdown

mariaWitch commented Apr 4, 2023

Torch 2.1 has no benefits for normal WebUI user. And existing Torch 1.13 is showing it's teeth with quite a few install issues lately.

Whole point of the PR is not to enable experimental use, but to make it simpler for normal users.

It isn't 2.1, we aren't waiting a whole major release, Torch 2.0.1 came out of phase 0 yesterday. I still believe that Torch 2.0 should not be merged until the blocking issue upstream is resolved in the next minor update as I believe PyTorch botched the initial GA release of 2.0, and we shouldn't be running that version of Pytorch until it is more mature.

@vladmandic
Copy link
Copy Markdown
Collaborator Author

Torch 2.1 has no benefits for normal WebUI user. And existing Torch 1.13 is showing it's teeth with quite a few install issues lately.
Whole point of the PR is not to enable experimental use, but to make it simpler for normal users.

It isn't 2.1, we aren't waiting a whole major release, Torch 2.0.1 came out of phase 0 yesterday. I still believe that Torch 2.0 should not be merged until the blocking issue upstream is resolved in the next minor update as I believe PyTorch botched the initial GA release of 2.0, and we shouldn't be running that version of Pytorch until it is more mature.

and then we'd have to wait for xformers to publish new wheels, etc...
again, torch.compile is not used by webui so there is no benefit standard user of waiting for torch 2.0.1.
and collab upgraded to torch 2.0 and so did many other hosted environments, so right now running webui requires additional manual steps - which is far more important to resolve than "wait for ideal version".

@mariaWitch
Copy link
Copy Markdown

and then we'd have to wait for xformers to publish new wheels, etc...
again, torch.compile is not used by webui so there is no benefit standard user of waiting for torch 2.0.1.
and collab upgraded to torch 2.0 and so did many other hosted environments, so right now running webui requires additional manual steps - which is far more important to resolve than "wait for ideal version".

As it stands right now, the only people you are claiming are effected are people using cloud setups, which most likely have already done the manual work to support PyTorch 2.0.0. There is no reason for PyTorch to be upgraded to 2.0.0 when it is very clearly NOT stable. It is not worth risking adding even more bugs to the code base as it currently stands.

@vladmandic
Copy link
Copy Markdown
Collaborator Author

very clearly NOT stable

that is a very strong statement. can you substantiate this? all errors i've seen so far have been related to torch.compile and yes, that feature is pretty much broken.

on the other hand, there are hundreds users using torch 2.0 with webui without issues.

@mariaWitch
Copy link
Copy Markdown

mariaWitch commented Apr 6, 2023

that is a very strong statement. can you substantiate this?

To name a few:
pytorch/pytorch#97031
pytorch/pytorch#97041
pytorch/pytorch#97226
pytorch/pytorch#97576
pytorch/pytorch#97021

And not only that, I disagree with moving to 2.0.0 on principle as .0.0 software is generally never stable. Waiting for 2.0.1 has no downsides whereas 2.0.0 is an unstable mess that they are still trying to get stable. The last thing this repo needs is more instability which causes more issues to flood in.

@vladmandic
Copy link
Copy Markdown
Collaborator Author

that is a very strong statement. can you substantiate this?

To name a few: pytorch/pytorch#97031 pytorch/pytorch#97041 pytorch/pytorch#97226 pytorch/pytorch#97576 pytorch/pytorch#97021

Bugs relevant to WebUI are what matters - why list random things - this is going in a wrong direction?

  • For example, I don't think anyone is trying to run it on RaspberryPi, lets stay on topic?
  • And WebUI uses venv, not conda, so conda on osx-64 also doesn't really apply.
  • Or does torchvision have debug symbols or not? I'd consider that cosmetic issue at best.
  • Yes, libnvrtc packaging bug is relevant, but there is no PR associated with it, so should we wait until indefinitely?

And not only that, I disagree with moving to 2.0.0 on principle as .0.0 software is generally never stable. Waiting for 2.0.1 or even 2.0.2 has no downsides whereas 2.0.0 is an unstable mess that they are still trying to get stable. The last thing this repo needs is more instability which causes more issues to flood in.

That is a question of personal preference and risk vs reward. Issue is that Torch 1.13 wheels are getting obsoleted in many packages and/or environment causing failures to install. So whats the solution? Ignore current issues until some unknown time in the future?

PR stands as-is and I've been using Torch 2.0 on my branch for a while now (and users on my branch are not reporting issues relevant to WebUI). We can agree to disagree here.

@mariaWitch
Copy link
Copy Markdown

Convolutions being broken for Cuda 11.8 builds specifically affects users who use Pytorch 2.0 and --opt-channelslast. It basically negates any possible performance benefits from that option.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the change?
If there is none, why is this here?
It's incredibly confusing.
(BTW: I am a relatively new user, please go easy on me)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under certain conditions, there could be a cache.json.lock file in addition to cache.json.

So this change (appending an asterisk) will cover both files.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! And you're fast :)

Copy link
Copy Markdown
Contributor

@DGdev91 DGdev91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for MacOS uses the --extra-index wich was meant for nvidia cards

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure "--extra-index-url https://download.pytorch.org/whl/cu118" is actually ok for mac os?

I don't know why that code was using version 1.12.1 instead a newer one (probably it was just never updated) but probably the right way to install it in mac os is by just using "pip install torch torchvision" without the --extra-index, as mentioned in the official website https://pytorch.org/get-started/locally/

@DGdev91
Copy link
Copy Markdown
Contributor

DGdev91 commented Apr 11, 2023

Recently PyTorch changed it's install command, it now uses --index-url instead of --extra-index-url, as mentioned in #9483

Also, i noticed your code doesn't cover AMD cards (in that cate TORCH_COMMAND is set in webui.sh
But it's fine, i had some problems on my 5700XT on pytorch2 (see #8139) and already covered that part in #9404, i feel it's better to stay on 1.31.1 a few more, at least on AMD.

@sjdevries
Copy link
Copy Markdown

Been having a lot of issues trying to get things working with my 5700XT. The new torch 2.0 version failed to generate any images.

Using the latest rocm as below failed to generate images or any console output:
export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2"

Using the torch command from the pr works:
export TORCH_COMMAND="pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 --index-url https://download.pytorch.org/whl/rocm5.2"

At least it is working on Commit hash: a9fed7c

@vladmandic
Copy link
Copy Markdown
Collaborator Author

rocm 5.6.0 alpha is out and it brings torch 2.0 compatibility, i'd be curious if that works.

@PennyFranklin
Copy link
Copy Markdown

5.6.0?maybe that will contain my 7900xtx

@DGdev91
Copy link
Copy Markdown
Contributor

DGdev91 commented Apr 17, 2023

rocm 5.6.0 alpha is out and it brings torch 2.0 compatibility, i'd be curious if that works.

Really? Where? I see 5.4.3 as the last realease on https://github.com/RadeonOpenCompute/ROCm/releases

@vladmandic
Copy link
Copy Markdown
Collaborator Author

rocm 5.6.0 alpha is out and it brings torch 2.0 compatibility, i'd be curious if that works.

Really? Where? I see 5.4.3 as the last realease on https://github.com/RadeonOpenCompute/ROCm/releases

https://rocmdocs.amd.com/projects/alpha/en/develop/deploy/install.html

@DGdev91
Copy link
Copy Markdown
Contributor

DGdev91 commented Apr 18, 2023

rocm 5.6.0 alpha is out and it brings torch 2.0 compatibility, i'd be curious if that works.

Really? Where? I see 5.4.3 as the last realease on https://github.com/RadeonOpenCompute/ROCm/releases

https://rocmdocs.amd.com/projects/alpha/en/develop/deploy/install.html

uhm... it doen't seem that's publicly available

5.6.0?maybe that will contain my 7900xtx

There was indeed a docker image for rocm 5.6.0 with 7900xtx support around, but it's now offline so i guess that code was intended for internal testing and not supposed to be released yet. Anyway, there was a discussion here #9591

I'm not sure if that works on other gpus like the 5700xt too. but i wouldn't be surprised if pytorch2.0 starts to work when the next rocm version will be released.

I guess for 5700xt users the better choice is sticking to the old 1.13.1 version and wait for an official rocm release

@AUTOMATIC1111 AUTOMATIC1111 changed the base branch from master to dev April 29, 2023 08:56
@AUTOMATIC1111 AUTOMATIC1111 merged commit 9eb49b0 into AUTOMATIC1111:dev Apr 29, 2023
@AUTOMATIC1111
Copy link
Copy Markdown
Owner

I'm pretty sure --extra-index-url https://download.pytorch.org/whl/cu118 for OSX is wrong but I don't have a mac to try it on.

@DGdev91
Copy link
Copy Markdown
Contributor

DGdev91 commented Apr 29, 2023

I'm pretty sure --extra-index-url https://download.pytorch.org/whl/cu118 for OSX is wrong but I don't have a mac to try it on.

According to PyTorch's website it should be just pip3 install torch torchvision without extra arguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.