Add separate cuda stream for live preview VAE#2844
Conversation
Add separate cuda stream for live preview VAE (lllyasviel#2844)
|
After the commit there seems to be an issue with respecting setting_show_progress_every_n_steps. I can't actually seem to notice any generation speed difference between updating preview every step and updating it rarely (at least with Approx NN) even on my dated system so having this setting at 1 makes it a non-issue for me. Still, worth a mention in case it affects someone with specific hardware/settings. |
Yeah, seems like a line got obliterated when I patched this, that's probably what did it. I'll put it back. |
This adds a separate CUDA stream (if cuda streams are enabled in args) for the live preview VAE so that the live preview processing can happen in parallel with main processing, and especially so that transferring the decoded image to the CPU does not block the main processing stream (should be beneficial since it is by far the largest blocking call, but would be much more beneficial if all blocking calls were removed from the main processing loop).
Internally, this should be handled very safely. If cudastreams are disabled, nothing happens differently than before. If cuda streams are enabled, we wait for the main stream to catch up before starting VAE processing so we don't grab some intermediate garbage tensor (this only blocks the live preview thread and will never block the main thread). Since the operation to move images to the CPU is already blocking and forces a sync (albeit on the VAE stream and not the main one) we don't need to ensure this stream is synced at the end.
While doing this I also did a couple of changes to the function for converting decoded images to PIL images. Converting float to uint8 should always use rounding because the default behavior when casting float to int is to truncate, but in image processing rounding to nearest int is the standard, and I set it up to do that now. I also changed it to use inplace operations where appropriate and for the clamp/etc operations to be done on the CPU.
This doesn't support XPU streams but could presumably be easily modified for it, I would want someone else who can actually test that to implement that though.
This is how the compute overlap looks in Nsight Systems. Default Stream 7 is the main stream, Stream 21 is the VAE stream. Live preview is set to every step:

Zoomed in view on one of the decoding sections:
CUDA stream synchronize here is where the VAE thread is waiting on the main stream to finish a step before starting to decode. As you can see, the main stream continues unimpeded.
If you look very closely at the Kernel utilization at the very top, you will notice that while doing VAE processing utilization is absolutely 100%, whereas there are some small dips during other sections. This is part of the performance benefit of using the live preview VAE this way, it can use some SMs that aren't being used by the current kernel on the main processing stream.
Also notable, on this I do have all of the blocking functions removed from the code so you can see the CUDA Overhead section showing a bunch of "Command Buffer Full" events. This won't happen on this yet but it is a step towards being able to have that happen.