Blog post about how to optimize LLMs for memory and speed#1473
Conversation
…etting_most_out_of_llms
younesbelkada
left a comment
There was a problem hiding this comment.
Very nice and great blogpost @patrickvonplaten !
I left few minor comments and open discussions, let me know what do you think!
Note that you can combine optimization tricks together, for example you can combine 8-bit / 4-bit and flash attention, I feel this is not clear to users and is definitely worth emphasizing it, what do you think?
| While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out. | ||
|
|
||
| Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference. | ||
|
|
There was a problem hiding this comment.
Note that you can make the 4-bit inference even faster by making sure bnb_4bit_compute_dtype is set to torch.float16. This should lead to faster inference than fp16 according to tim: https://twitter.com/Tim_Dettmers/status/1683118705956491264?s=20
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)The value of bnb_4bit_compute_dtype is set to torch.float32 by default for some reason, I think that we could also open a PR to make that value to fp16 by default. https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L204
There was a problem hiding this comment.
Can I also run this in bfloat16? I'm running everything in bfloat16 here not in float16
There was a problem hiding this comment.
Yes you can also run it in bfloat16 but it will be slower AFAIK
There was a problem hiding this comment.
Hmm ok maybe a bit too much for this blog post then. I'll leave as is for now - the user can check the docs for more info as mentioned below.
gante
left a comment
There was a problem hiding this comment.
This is a great read! I'm sure the community will appreciate this summary of modern performance techniques 🔥
An additional idea. While I don't think we should add more content here, we could have a further read section, where we could leave links to even more advanced techniques like:
- Speculative Decoding
- Paged attention
- SARATHI: Balance pre-filling (compute-bound) with new token generation (memory bandwidth-bound) (https://twitter.com/agrawalamey12/status/1698078186897134028)
|
|
||
| The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences. | ||
|
|
||
| In this blog post, we will go over the most effective techniques to tackle these challenges for efficient LLM deployment: |
There was a problem hiding this comment.
Since the field is moving very fast and the doc may quickly become outdated, I'd add some time-based caveat to this sentence.
Something like "...the most effective techniques at the time of writing..."
| Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation. | ||
| Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required. | ||
|
|
||
| > Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model. |
There was a problem hiding this comment.
suggestion: I would add here (or close to this paragraph) that one can inspect the model's pretrained/fine-tuned precision in the torch_dtype config attribute, and that selecting the same precision type is often a good idea (unless it's float32)
There was a problem hiding this comment.
Good idea!
|
Merging in invisible mode for now |
|
Let's please move the assets out of the directory. |
I don't understand this |
|
We only keep the thumbnails in the repository these days. This PR introduced additional assets (images) that are non-thumbnail ones. Those should reside in https://huggingface.co/datasets/huggingface/documentation-images. |
…e#1473) * correct (( * [LLMs] Getting most out of LLMS * finish * finish * Fix * finish * finish * finish * finish * Apply suggestions from code review Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * improve --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
This blog post does a deep dive into:
TODOs:
The corresponding
transformersdoc PR is here: huggingface/transformers#26058