Blog post about how to optimize LLMs for memory and speed by patrickvonplaten · Pull Request #1473 · huggingface/blog

patrickvonplaten · 2023-09-10T12:07:53Z

This blog post does a deep dive into:

Optimizing memory consumption with 8bit/4bit
Optimizing speed with Flash Attention
Explain smart MQA/GQA as well as Alibi/RoPE

TODOs:

Title is currently called "Getting the most out of LLMS" - is there a better title maybe? Maybe something like "Optimizing LLMs for memory and speed"? (wdyt @gante ?)
Currently there is no nice thumbnail. Would be good to make one.

The corresponding transformers doc PR is here: huggingface/transformers#26058

…etting_most_out_of_llms

younesbelkada

Very nice and great blogpost @patrickvonplaten !
I left few minor comments and open discussions, let me know what do you think!

Note that you can combine optimization tricks together, for example you can combine 8-bit / 4-bit and flash attention, I feel this is not clear to users and is definitely worth emphasizing it, what do you think?

younesbelkada · 2023-09-11T08:55:38Z

+While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.
+
+Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.
+


Note that you can make the 4-bit inference even faster by making sure bnb_4bit_compute_dtype is set to torch.float16. This should lead to faster inference than fp16 according to tim: https://twitter.com/Tim_Dettmers/status/1683118705956491264?s=20

import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

The value of bnb_4bit_compute_dtype is set to torch.float32 by default for some reason, I think that we could also open a PR to make that value to fp16 by default. https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L204

Can I also run this in bfloat16? I'm running everything in bfloat16 here not in float16

Yes you can also run it in bfloat16 but it will be slower AFAIK

Hmm ok maybe a bit too much for this blog post then. I'll leave as is for now - the user can check the docs for more info as mentioned below.

gante

This is a great read! I'm sure the community will appreciate this summary of modern performance techniques 🔥

An additional idea. While I don't think we should add more content here, we could have a further read section, where we could leave links to even more advanced techniques like:

Speculative Decoding
Paged attention
SARATHI: Balance pre-filling (compute-bound) with new token generation (memory bandwidth-bound) (https://twitter.com/agrawalamey12/status/1698078186897134028)

gante · 2023-09-11T16:51:51Z

+
+The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.
+
+In this blog post, we will go over the most effective techniques to tackle these challenges for efficient LLM deployment:


Since the field is moving very fast and the doc may quickly become outdated, I'd add some time-based caveat to this sentence.

Something like "...the most effective techniques at the time of writing..."

gante · 2023-09-11T17:23:36Z

+Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation.
+Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required.
+
+> Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model.


suggestion: I would add here (or close to this paragraph) that one can inspect the model's pretrained/fine-tuned precision in the torch_dtype config attribute, and that selecting the same precision type is often a good idea (unless it's float32)

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

patrickvonplaten · 2023-09-15T11:42:08Z

Merging in invisible mode for now

sayakpaul · 2023-09-15T16:27:22Z

Let's please move the assets out of the directory.

patrickvonplaten · 2023-09-18T11:15:52Z

Let's please move the assets out of the directory.

I don't understand this

sayakpaul · 2023-09-18T11:44:27Z

We only keep the thumbnails in the repository these days. This PR introduced additional assets (images) that are non-thumbnail ones. Those should reside in https://huggingface.co/datasets/huggingface/documentation-images.

…e#1473) * correct (( * [LLMs] Getting most out of LLMS * finish * finish * Fix * finish * finish * finish * finish * Apply suggestions from code review Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * improve --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

patrickvonplaten added 12 commits June 20, 2023 14:43

correct ((

0442a2e

Merge branch 'main' of https://github.com/huggingface/blog

328bbf6

Merge branch 'main' of https://github.com/huggingface/blog

ed62bb7

[LLMs] Getting most out of LLMS

0550ec3

Merge branch 'main' of https://github.com/huggingface/blog into add_g…

bec909c

…etting_most_out_of_llms

finish

43cc073

finish

b386a2f

Fix

0aef065

finish

0c4e44a

finish

8be1617

finish

a19cbe0

finish

0e3cdcb

patrickvonplaten changed the title ~~Add getting most out of llms~~ Blog post about how to optimize LLMs for memory and speed Sep 10, 2023

patrickvonplaten requested review from gante and younesbelkada September 10, 2023 22:42

patrickvonplaten mentioned this pull request Sep 10, 2023

Add LLM doc huggingface/transformers#26058

Merged

younesbelkada reviewed Sep 11, 2023

View reviewed changes

gante reviewed Sep 11, 2023

View reviewed changes

julien-c reviewed Sep 12, 2023

View reviewed changes

Comment thread _blog.yml Outdated

patrickvonplaten commented Sep 14, 2023

View reviewed changes

Comment thread getting_the_most_out_of_LLMs.md Outdated

patrickvonplaten commented Sep 14, 2023

View reviewed changes

Comment thread getting_the_most_out_of_LLMs.md Outdated

patrickvonplaten and others added 3 commits September 14, 2023 20:09

Apply suggestions from code review

fe3b0e0

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

improve

db46e28

finish

6fd5a50

patrickvonplaten merged commit 5806fb3 into main Sep 15, 2023

patrickvonplaten deleted the add_getting_most_out_of_llms branch September 15, 2023 11:42

		While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.

		Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.


		The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

		In this blog post, we will go over the most effective techniques to tackle these challenges for efficient LLM deployment:

Conversation

patrickvonplaten commented Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODOs:

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

younesbelkada Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

younesbelkada Sep 15, 2023

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gante Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten commented Sep 15, 2023

Uh oh!

sayakpaul commented Sep 15, 2023

Uh oh!

patrickvonplaten commented Sep 18, 2023

Uh oh!

sayakpaul commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

patrickvonplaten commented Sep 10, 2023 •

edited

Loading