Skip to content

Add LLM doc#26058

Merged
patrickvonplaten merged 16 commits into
mainfrom
add_llm_improve
Oct 16, 2023
Merged

Add LLM doc#26058
patrickvonplaten merged 16 commits into
mainfrom
add_llm_improve

Conversation

@patrickvonplaten

@patrickvonplaten patrickvonplaten commented Sep 8, 2023

Copy link
Copy Markdown
Contributor

Optimize LLM tutorial

This PR adds the doc version of https://huggingface.co/blog/optimize-llm .
Given the many code snippets I think that the tutorial is a nice addition to the Transfomers docs.

@patrickvonplaten patrickvonplaten marked this pull request as draft September 8, 2023 18:22
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
@patrickvonplaten

Copy link
Copy Markdown
Contributor Author

Doc will also be released with blog post: huggingface/blog#1473

@patrickvonplaten patrickvonplaten changed the title [WIP] Add LLM doc Add LLM doc Sep 25, 2023
@HuggingFaceDocBuilderDev

HuggingFaceDocBuilderDev commented Sep 25, 2023

Copy link
Copy Markdown

The documentation is not available anymore as the PR was closed or merged.

@patrickvonplaten patrickvonplaten marked this pull request as ready for review September 25, 2023 18:18

@younesbelkada younesbelkada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great thanks ! Happy to add a FA-2 section in a follow up PR !

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated

@MKhalusova MKhalusova left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this doc! It's a valuable read with lots of detailed information on LLM inference optimizations. Here are some thoughts from me:

  • Given the level of detail, and the narrative style, I think the doc would fit better under the Conceptual Guides section rather than in Tutorials.
  • I would suggest changing the title of the doc as it is not very descriptive. Perhaps "LLM inference optimizations"/"Efficient LLM deployment"?
  • It would also be great to shorten/simplify the guide where possible.
  • There are some really cool gems in the doc that would be great to highlight somehow (bold font or using a <Tip>), e.g. "Therefore, inference time is often not reduced when using quantized weights, but rather increases".
  • I have left some nits regarding style, issues with formula rendering, etc.

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated

In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment:

1. **Lower Precision**: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have links to corresponding sections here for faster navigation.

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
patrickvonplaten and others added 2 commits October 11, 2023 19:24
Co-authored-by: Maria Khalusova <kafooster@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/_toctree.yml Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
@patrickvonplaten

Copy link
Copy Markdown
Contributor Author

Thanks for the review @MKhalusova!

Answering the bullet points directly in line:

Given the level of detail, and the narrative style, I think the doc would fit better under the Conceptual Guides section rather than in Tutorials.

Happy to move it to the conceptual guides if it fits better - @gante wdyt?

I would suggest changing the title of the doc as it is not very descriptive. Perhaps "LLM inference optimizations"/"Efficient LLM deployment"?

Yes, agree. Should be better now.

It would also be great to shorten/simplify the guide where possible.

I'd need more precise comments here. We could maybe also do this in a second pass. As a general guide on how to optimize LLMs for memory and speed, I'm not sure how I would simplify or shorten it.

There are some really cool gems in the doc that would be great to highlight somehow (bold font or using a ), e.g. "Therefore, inference time is often not reduced when using quantized weights, but rather increases".

Guess this is also a style issue. Maybe this can be done in a second pass to align it more with other docs.

I have left some nits regarding style, issues with formula rendering, etc.

Addressed them

@patrickvonplaten

Copy link
Copy Markdown
Contributor Author

@gante could you also take a look here?

@MKhalusova MKhalusova left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback!

@gante gante left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition to our documentation 💪 I've added a few nits.

Regarding position in the TOC -- despite having a lot of code, this doc is also very technical (diving into equations, talking about GPU internals, ...), and thus I agree with @MKhalusova's suggestion of moving to Conceptual Guides.

We also already have an LLM intro in the tutorials, we should definitely add a reference to this doc there!

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated

```py
flush()
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very cool to leave a link to Flash Attention 2 + transformers by the end of this section! (https://huggingface.co/docs/transformers/v4.34.0/en/perf_infer_gpu_one#flash-attention-2)

- 1. Quantize all weights to the target precision
- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
- 4. Quantize the weights again to the target precision after computation with their inputs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was revisiting this question with @younesbelkada, diving into bnb code, and we concluded that 4. only happens at train time :)

E.g. we can see here that the intermediary tensors are discarded if the gradients are not needed. We can also see the requantization in the backward function (here)

We should add this detail here and in the blog post as well :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense! Thanks for double-checking

Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
Comment thread docs/source/en/llm_tutorial_optimization.md Outdated
@patrickvonplaten patrickvonplaten merged commit 805d5d2 into main Oct 16, 2023
@patrickvonplaten patrickvonplaten deleted the add_llm_improve branch October 16, 2023 14:09
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* [WIP] Add LLM doc

* rename

* latex

* latex

* Fix more latex

* [LLMs] Getting most out of LLMS

* improve

* try again

* Apply suggestions from code review

Co-authored-by: Maria Khalusova <kafooster@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/en/llm_tutorial_optimization.md

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Apply suggestions from code review

* move file

---------

Co-authored-by: Maria Khalusova <kafooster@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants