[Docs] add quantization docs#3253
Conversation
|
I personally think we should only keep the markdown file, since quantization ipynb would relaunch the kernel multiple times and this is not a basic feature. @shuaills |
|
@yinfan98 Fix lint and remove ipynb. thansk |
| To do offline quantization for your model, firstly you need to install [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: | ||
|
|
||
| ```bash | ||
| pip install llmcompressor | ||
| ``` | ||
|
|
||
| Here, we take quantize `meta-llama/Meta-Llama-3-8B-Instruct` to `FP8` as an example to elaborate on how to do offline quantization. | ||
|
|
There was a problem hiding this comment.
Add example to load officially (or community) quantized weights, such as by AWQ?
|
|
||
| `SGLang` support various quantization methods, including online dynamic quantization and offline quantization. | ||
|
|
||
| Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading. |
There was a problem hiding this comment.
Add a hyperlink to Nvidia delayed scaling example in the document?
There was a problem hiding this comment.
|
|
||
| ## Online Quantization | ||
|
|
||
| > Note: Although we support online quantization, we recommend users to use quantized models. |
There was a problem hiding this comment.
Maybe can be modified to "we recommend users to use offline quantized models" to avoid confusion
| --port 30000 --host 0.0.0.0 | ||
| ``` | ||
|
|
||
| **Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine.** |
There was a problem hiding this comment.
Might be "If the model has already been quantized offline"
|
Thanks @Edenzzzz @Fridge003 |
|
Addressed by #3410 |
Motivation
Add docs for quantization. This PR is Change from previous one : #2572. cc: @zhaochenyang20
Modifications
This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:
Add document of quantization docs/backend/quantization.md
Modified docs/index.rst, to inlcude the quantization docs into SGLang documentation.
Checklist