Skip to content

Blog post about how to optimize LLMs for memory and speed#1473

Merged
patrickvonplaten merged 15 commits into
mainfrom
add_getting_most_out_of_llms
Sep 15, 2023
Merged

Blog post about how to optimize LLMs for memory and speed#1473
patrickvonplaten merged 15 commits into
mainfrom
add_getting_most_out_of_llms

Conversation

@patrickvonplaten

@patrickvonplaten patrickvonplaten commented Sep 10, 2023

Copy link
Copy Markdown
Contributor

This blog post does a deep dive into:

  • Optimizing memory consumption with 8bit/4bit
  • Optimizing speed with Flash Attention
  • Explain smart MQA/GQA as well as Alibi/RoPE

TODOs:

  • Title is currently called "Getting the most out of LLMS" - is there a better title maybe? Maybe something like "Optimizing LLMs for memory and speed"? (wdyt @gante ?)
  • Currently there is no nice thumbnail. Would be good to make one.

The corresponding transformers doc PR is here: huggingface/transformers#26058

@patrickvonplaten patrickvonplaten changed the title Add getting most out of llms Blog post about how to optimize LLMs for memory and speed Sep 10, 2023

@younesbelkada younesbelkada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and great blogpost @patrickvonplaten !
I left few minor comments and open discussions, let me know what do you think!

Note that you can combine optimization tricks together, for example you can combine 8-bit / 4-bit and flash attention, I feel this is not clear to users and is definitely worth emphasizing it, what do you think?

Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md
Comment thread getting_the_most_out_of_LLMs.md
Comment thread getting_the_most_out_of_LLMs.md Outdated
While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.

Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can make the 4-bit inference even faster by making sure bnb_4bit_compute_dtype is set to torch.float16. This should lead to faster inference than fp16 according to tim: https://twitter.com/Tim_Dettmers/status/1683118705956491264?s=20

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

The value of bnb_4bit_compute_dtype is set to torch.float32 by default for some reason, I think that we could also open a PR to make that value to fp16 by default. https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L204

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I also run this in bfloat16? I'm running everything in bfloat16 here not in float16

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can also run it in bfloat16 but it will be slower AFAIK

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok maybe a bit too much for this blog post then. I'll leave as is for now - the user can check the docs for more info as mentioned below.

Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md

@gante gante left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great read! I'm sure the community will appreciate this summary of modern performance techniques 🔥

An additional idea. While I don't think we should add more content here, we could have a further read section, where we could leave links to even more advanced techniques like:

Comment thread getting_the_most_out_of_LLMs.md Outdated

The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

In this blog post, we will go over the most effective techniques to tackle these challenges for efficient LLM deployment:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the field is moving very fast and the doc may quickly become outdated, I'd add some time-based caveat to this sentence.

Something like "...the most effective techniques at the time of writing..."

Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation.
Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required.

> Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would add here (or close to this paragraph) that one can inspect the model's pretrained/fine-tuned precision in the torch_dtype config attribute, and that selecting the same precision type is often a good idea (unless it's float32)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread _blog.yml Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
Comment thread getting_the_most_out_of_LLMs.md Outdated
patrickvonplaten and others added 3 commits September 14, 2023 20:09
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
@patrickvonplaten

Copy link
Copy Markdown
Contributor Author

Merging in invisible mode for now

@patrickvonplaten patrickvonplaten merged commit 5806fb3 into main Sep 15, 2023
@patrickvonplaten patrickvonplaten deleted the add_getting_most_out_of_llms branch September 15, 2023 11:42
@sayakpaul

Copy link
Copy Markdown
Member

Let's please move the assets out of the directory.

@patrickvonplaten

Copy link
Copy Markdown
Contributor Author

Let's please move the assets out of the directory.

I don't understand this

@sayakpaul

Copy link
Copy Markdown
Member

We only keep the thumbnails in the repository these days. This PR introduced additional assets (images) that are non-thumbnail ones. Those should reside in https://huggingface.co/datasets/huggingface/documentation-images.

kashif pushed a commit to metric-space/blog that referenced this pull request Sep 29, 2023
…e#1473)

* correct ((

* [LLMs] Getting most out of LLMS

* finish

* finish

* Fix

* finish

* finish

* finish

* finish

* Apply suggestions from code review

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* improve

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants