Skip to content

Add separate cuda stream for live preview VAE#2844

Merged
catboxanon merged 2 commits into
lllyasviel:mainfrom
drhead:main
May 1, 2025
Merged

Add separate cuda stream for live preview VAE#2844
catboxanon merged 2 commits into
lllyasviel:mainfrom
drhead:main

Conversation

@drhead

@drhead drhead commented Apr 29, 2025

Copy link
Copy Markdown
Contributor

This adds a separate CUDA stream (if cuda streams are enabled in args) for the live preview VAE so that the live preview processing can happen in parallel with main processing, and especially so that transferring the decoded image to the CPU does not block the main processing stream (should be beneficial since it is by far the largest blocking call, but would be much more beneficial if all blocking calls were removed from the main processing loop).

Internally, this should be handled very safely. If cudastreams are disabled, nothing happens differently than before. If cuda streams are enabled, we wait for the main stream to catch up before starting VAE processing so we don't grab some intermediate garbage tensor (this only blocks the live preview thread and will never block the main thread). Since the operation to move images to the CPU is already blocking and forces a sync (albeit on the VAE stream and not the main one) we don't need to ensure this stream is synced at the end.

While doing this I also did a couple of changes to the function for converting decoded images to PIL images. Converting float to uint8 should always use rounding because the default behavior when casting float to int is to truncate, but in image processing rounding to nearest int is the standard, and I set it up to do that now. I also changed it to use inplace operations where appropriate and for the clamp/etc operations to be done on the CPU.

This doesn't support XPU streams but could presumably be easily modified for it, I would want someone else who can actually test that to implement that though.

This is how the compute overlap looks in Nsight Systems. Default Stream 7 is the main stream, Stream 21 is the VAE stream. Live preview is set to every step:
image

Zoomed in view on one of the decoding sections:

image

CUDA stream synchronize here is where the VAE thread is waiting on the main stream to finish a step before starting to decode. As you can see, the main stream continues unimpeded.

If you look very closely at the Kernel utilization at the very top, you will notice that while doing VAE processing utilization is absolutely 100%, whereas there are some small dips during other sections. This is part of the performance benefit of using the live preview VAE this way, it can use some SMs that aren't being used by the current kernel on the main processing stream.

Also notable, on this I do have all of the blocking functions removed from the code so you can see the CUDA Overhead section showing a bunch of "Command Buffer Full" events. This won't happen on this yet but it is a step towards being able to have that happen.

@drhead drhead requested a review from lllyasviel as a code owner April 29, 2025 12:34
@catboxanon catboxanon merged commit d357396 into lllyasviel:main May 1, 2025
spawner1145 added a commit to spawner1145/stable-diffusion-webui-forge that referenced this pull request May 2, 2025
Add separate cuda stream for live preview VAE (lllyasviel#2844)
@anon0730

anon0730 commented May 3, 2025

Copy link
Copy Markdown

After the commit there seems to be an issue with respecting setting_show_progress_every_n_steps.
For example if you set it to update preview every 10 steps it will first show you the preview on step 10 and then update the preview every step after that instead of updating it at step 20, 30 and so on.

I can't actually seem to notice any generation speed difference between updating preview every step and updating it rarely (at least with Approx NN) even on my dated system so having this setting at 1 makes it a non-issue for me. Still, worth a mention in case it affects someone with specific hardware/settings.

@drhead

drhead commented May 3, 2025

Copy link
Copy Markdown
Contributor Author

After the commit there seems to be an issue with respecting setting_show_progress_every_n_steps. For example if you set it to update preview every 10 steps it will first show you the preview on step 10 and then update the preview every step after that instead of updating it at step 20, 30 and so on.

I can't actually seem to notice any generation speed difference between updating preview every step and updating it rarely (at least with Approx NN) even on my dated system so having this setting at 1 makes it a non-issue for me. Still, worth a mention in case it affects someone with specific hardware/settings.

Yeah, seems like a line got obliterated when I patched this, that's probably what did it. I'll put it back.

Haoming02 added a commit to Haoming02/sd-webui-forge-classic that referenced this pull request May 21, 2025
lshqqytiger pushed a commit to lshqqytiger/stable-diffusion-webui-amdgpu-forge that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants