Add LLM doc#26058
Conversation
|
Doc will also be released with blog post: huggingface/blog#1473 |
|
The documentation is not available anymore as the PR was closed or merged. |
younesbelkada
left a comment
There was a problem hiding this comment.
Looking great thanks ! Happy to add a FA-2 section in a follow up PR !
MKhalusova
left a comment
There was a problem hiding this comment.
Thanks for adding this doc! It's a valuable read with lots of detailed information on LLM inference optimizations. Here are some thoughts from me:
- Given the level of detail, and the narrative style, I think the doc would fit better under the Conceptual Guides section rather than in Tutorials.
- I would suggest changing the title of the doc as it is not very descriptive. Perhaps "LLM inference optimizations"/"Efficient LLM deployment"?
- It would also be great to shorten/simplify the guide where possible.
- There are some really cool gems in the doc that would be great to highlight somehow (bold font or using a
<Tip>), e.g. "Therefore, inference time is often not reduced when using quantized weights, but rather increases". - I have left some nits regarding style, issues with formula rendering, etc.
|
|
||
| In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment: | ||
|
|
||
| 1. **Lower Precision**: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance. |
There was a problem hiding this comment.
It would be great to have links to corresponding sections here for faster navigation.
Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
|
Thanks for the review @MKhalusova! Answering the bullet points directly in line:
Happy to move it to the conceptual guides if it fits better - @gante wdyt?
Yes, agree. Should be better now.
I'd need more precise comments here. We could maybe also do this in a second pass. As a general guide on how to optimize LLMs for memory and speed, I'm not sure how I would simplify or shorten it.
Guess this is also a style issue. Maybe this can be done in a second pass to align it more with other docs.
Addressed them |
|
@gante could you also take a look here? |
MKhalusova
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback!
gante
left a comment
There was a problem hiding this comment.
Nice addition to our documentation 💪 I've added a few nits.
Regarding position in the TOC -- despite having a lot of code, this doc is also very technical (diving into equations, talking about GPU internals, ...), and thus I agree with @MKhalusova's suggestion of moving to Conceptual Guides.
We also already have an LLM intro in the tutorials, we should definitely add a reference to this doc there!
|
|
||
| ```py | ||
| flush() | ||
| ``` |
There was a problem hiding this comment.
It would be very cool to leave a link to Flash Attention 2 + transformers by the end of this section! (https://huggingface.co/docs/transformers/v4.34.0/en/perf_infer_gpu_one#flash-attention-2)
| - 1. Quantize all weights to the target precision | ||
| - 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision | ||
| - 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision | ||
| - 4. Quantize the weights again to the target precision after computation with their inputs. |
There was a problem hiding this comment.
I was revisiting this question with @younesbelkada, diving into bnb code, and we concluded that 4. only happens at train time :)
E.g. we can see here that the intermediary tensors are discarded if the gradients are not needed. We can also see the requantization in the backward function (here)
We should add this detail here and in the blog post as well :)
There was a problem hiding this comment.
Yes that makes sense! Thanks for double-checking
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* [WIP] Add LLM doc * rename * latex * latex * Fix more latex * [LLMs] Getting most out of LLMS * improve * try again * Apply suggestions from code review Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/en/llm_tutorial_optimization.md * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Apply suggestions from code review * move file --------- Co-authored-by: Maria Khalusova <kafooster@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Optimize LLM tutorial
This PR adds the doc version of https://huggingface.co/blog/optimize-llm .
Given the many code snippets I think that the tutorial is a nice addition to the Transfomers docs.