Add LLM doc by patrickvonplaten · Pull Request #26058 · huggingface/transformers

patrickvonplaten · 2023-09-08T18:22:24Z

Optimize LLM tutorial

This PR adds the doc version of https://huggingface.co/blog/optimize-llm .
Given the many code snippets I think that the tutorial is a nice addition to the Transfomers docs.

patrickvonplaten · 2023-09-10T22:43:10Z

Doc will also be released with blog post: huggingface/blog#1473

HuggingFaceDocBuilderDev · 2023-09-25T18:15:52Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Looking great thanks ! Happy to add a FA-2 section in a follow up PR !

MKhalusova

Thanks for adding this doc! It's a valuable read with lots of detailed information on LLM inference optimizations. Here are some thoughts from me:

Given the level of detail, and the narrative style, I think the doc would fit better under the Conceptual Guides section rather than in Tutorials.
I would suggest changing the title of the doc as it is not very descriptive. Perhaps "LLM inference optimizations"/"Efficient LLM deployment"?
It would also be great to shorten/simplify the guide where possible.
There are some really cool gems in the doc that would be great to highlight somehow (bold font or using a <Tip>), e.g. "Therefore, inference time is often not reduced when using quantized weights, but rather increases".
I have left some nits regarding style, issues with formula rendering, etc.

MKhalusova · 2023-09-26T14:08:06Z

+
+In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment:
+
+1.  **Lower Precision**: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.


It would be great to have links to corresponding sections here for faster navigation.

Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

patrickvonplaten · 2023-10-11T17:39:34Z

Thanks for the review @MKhalusova!

Answering the bullet points directly in line:

Given the level of detail, and the narrative style, I think the doc would fit better under the Conceptual Guides section rather than in Tutorials.

Happy to move it to the conceptual guides if it fits better - @gante wdyt?

I would suggest changing the title of the doc as it is not very descriptive. Perhaps "LLM inference optimizations"/"Efficient LLM deployment"?

Yes, agree. Should be better now.

It would also be great to shorten/simplify the guide where possible.

I'd need more precise comments here. We could maybe also do this in a second pass. As a general guide on how to optimize LLMs for memory and speed, I'm not sure how I would simplify or shorten it.

There are some really cool gems in the doc that would be great to highlight somehow (bold font or using a ), e.g. "Therefore, inference time is often not reduced when using quantized weights, but rather increases".

Guess this is also a style issue. Maybe this can be done in a second pass to align it more with other docs.

I have left some nits regarding style, issues with formula rendering, etc.

Addressed them

patrickvonplaten · 2023-10-11T17:39:48Z

@gante could you also take a look here?

MKhalusova

Thanks for addressing the feedback!

gante

Nice addition to our documentation 💪 I've added a few nits.

Regarding position in the TOC -- despite having a lot of code, this doc is also very technical (diving into equations, talking about GPU internals, ...), and thus I agree with @MKhalusova's suggestion of moving to Conceptual Guides.

We also already have an LLM intro in the tutorials, we should definitely add a reference to this doc there!

gante · 2023-10-12T11:51:28Z

+
+```py
+flush()
+```


It would be very cool to leave a link to Flash Attention 2 + transformers by the end of this section! (https://huggingface.co/docs/transformers/v4.34.0/en/perf_infer_gpu_one#flash-attention-2)

gante · 2023-10-12T13:42:58Z

+-   1.  Quantize all weights to the target precision
+-   2.  Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
+-   3.  Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
+-   4.  Quantize the weights again to the target precision after computation with their inputs.


I was revisiting this question with @younesbelkada, diving into bnb code, and we concluded that 4. only happens at train time :)

E.g. we can see here that the intermediary tensors are discarded if the gradients are not needed. We can also see the requantization in the backward function (here)

We should add this detail here and in the blog post as well :)

Yes that makes sense! Thanks for double-checking

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* [WIP] Add LLM doc * rename * latex * latex * Fix more latex * [LLMs] Getting most out of LLMS * improve * try again * Apply suggestions from code review Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/en/llm_tutorial_optimization.md * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Apply suggestions from code review * move file --------- Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

[WIP] Add LLM doc

435b929

patrickvonplaten marked this pull request as draft September 8, 2023 18:22

patrickvonplaten added 4 commits September 8, 2023 20:24

rename

65875ff

latex

5e23341

latex

7335fcd

Fix more latex

61ee20f

patrickvonplaten commented Sep 10, 2023

View reviewed changes

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated

patrickvonplaten mentioned this pull request Sep 10, 2023

Latex/Katex \text{...} is not correctly rendered huggingface/doc-builder#397

Closed

[LLMs] Getting most out of LLMS

c692960

patrickvonplaten mentioned this pull request Sep 10, 2023

Blog post about how to optimize LLMs for memory and speed huggingface/blog#1473

Merged

2 tasks

patrickvonplaten added 2 commits September 15, 2023 14:53

improve

6a8706c

try again

771e48f

patrickvonplaten changed the title ~~[WIP] Add LLM doc~~ Add LLM doc Sep 25, 2023

patrickvonplaten marked this pull request as ready for review September 25, 2023 18:18

patrickvonplaten requested review from MKhalusova and younesbelkada September 25, 2023 18:18

younesbelkada approved these changes Sep 26, 2023

View reviewed changes

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated

MKhalusova suggested changes Sep 26, 2023

View reviewed changes

patrickvonplaten commented Oct 11, 2023

View reviewed changes