-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Guide] Quantize your Diffusion Models with bnb
#10012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice start! 👏
I think you can combine this guide with the existing one here since there is quite a bit of overlap between the two. Here are some general tips for doing that:
- Keep the introduction in the existing guide but add a few sentences that adapts it to quantizing Flux.1-dev with bitsandbytes so you can run it on hardware with less than 16GB of memory. I think most users at this point have a general idea of what quantization is (and it is also covered in the getting started), so we don't need to spend more time on what it is/why it is important. The focus is more on bitsandbytes than quantization in general.
- I don't think it's necessary to have a section for showing how to use an unquantized model. Users are probably more eager to see how they can use a quantized model and getting them there as quickly as possible would be better.
- Combine the 8-bit quantization section with the existing one here. You can add here about how you're quantizing both the
T5EncoderModelandFluxTransformer2DModel, what thelow_cpu_mem_usageanddevice_map(if you have more than one GPU) parameter do. - You can do the same thing with the 4-bit section. Combine it with the existing one and add a few lines explaining the parameters.
- Combine the NF4 quantization section with the one here.
- Lead with the visualization in the method comparison section. Most users probably aren't too interested in comparing and running all this code themselves, so it's more impactful to lead with the results first.
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some nits. Greatly agree with @stevhliu's comments and recommendations.
| ```python | ||
| memory_allocated = torch.cuda.max_memory_allocated(0) / (1024 ** 3) | ||
| print(f"GPU Memory Allocated: {memory_allocated:.2f} GB") | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reader, I'd like to know how much it was at this point.
Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice rework, should be ready to go soon! 🚀
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit!
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
@stevhliu thanks for the thorough review. I have taken care of all the suggestions. 🤗 |
|
Awesome, thanks so much for iterating on this! I'll give @sayakpaul a chance to review it and then we can merge 🤗 |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of changes are related to breaking of long lines into multiple ones. So, it's difficult for me to go through the true changes.
If possible, could you please undo those breaks? If not, I am okay with merging it since @stevhliu has already reviewed it.
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvements! I think we're close to merge -- just a few comments, mostly related to nits.
| "black-forest-labs/FLUX.1-dev", | ||
| subfolder="text_encoder_2", | ||
| quantization_config=quant_config, | ||
| torch_dtype=torch.float16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a note about the torch_dtype here. I am thinking of something like so:
Depending on the GPU, set your
torch_dtype. For Ada and higher series GPUs supporttorch.bfloat16and we suggest using it when applicable.
@stevhliu any suggestions?
| transformer_8bit = FluxTransformer2DModel.from_pretrained( | ||
| "black-forest-labs/FLUX.1-dev", | ||
| subfolder="transformer", | ||
| quantization_config=quant_config, | ||
| torch_dtype=torch.float32, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could show this in a diff rather than py block as the only difference here (compared to the above snippets) is torch_dtype? So, I would do:
transformer_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quant_config,
+ torch_dtype=torch.float32,
)| **pipe_kwargs, | ||
| ).images[0] | ||
|
|
||
| image.resize((224, 224)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have to show this line of code here.
| image.resize((224, 224)) |
| <div class="flex justify-center"> | ||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/> | ||
| </div> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to comment on the following things:
- When memory permits, users can directly move the pipeline to the GPU by doing
to("cuda"). - To go easy on the memory, they can also use
enable_model_cpu_offload().
| bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the | ||
| [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a little [!NOTE] here saying that here we usually don't quantize the CLIPTextModel as it's small enough and the AutoencoderKL because it doesn't contain too many torch.nn.Linear layers.
| image = pipe( | ||
| generator=torch.Generator("cpu").manual_seed(0), | ||
| **pipe_kwargs, | ||
| ).images[0] | ||
|
|
||
| image.resize((224, 224)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| > [!Note] | ||
| > Depending on the GPU, set your `torch_dtype`. For Ada and higher series GPUs support `torch.bfloat16` and we suggest using it when applicable. | ||
| > [!Note] | ||
| > We do not qunatize the `CLIPTextModel` and the `AutoencoderKL` due to their small size, and also for the fact that `AutoencoderKL` has very few `torch.nn.Linear` layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already added it above no? Or is this for separate 8bit and 4bit sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
I will let @stevhliu review the new changes and get this merged.
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, just a few more changes!
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
Thank you for the review @sayakpaul, @stevhliu and @pcuenca I have made the changes. I think we are ready to go 🚀 |
| bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the | ||
| [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. | ||
|
|
||
| > [!Note] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to make the same changes to the 4-bit section here too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologise for missing it!
I have added the same updated to the 4-bit section as well.
* chore: initial draft * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * chore: link in place * chore: review suggestions * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * chore: review suggestions * Update docs/source/en/quantization/bitsandbytes.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * review suggestions * chore: review suggestions * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * adding same changes to 4 bit section * review suggestions --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

This PR adds a guide on quantization of diffusion models using
bnbanddiffusersHere is a colab notebook for easy code access
CC: @stevhliu @sayakpaul