[Docs] add quantization docs#2572
Conversation
zhaochenyang20
left a comment
There was a problem hiding this comment.
Tell the users when should we prefer offline
quantization rather than online. For example, ke @ispobock told me that for Deepseek V2.5, we should always do offline rather than online. Could you please let the users know when and why we should do this?
Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.
Lastly, I suggest that this should be a ipynb rather than a markdown file. Since there are several things can be run in jupyter notebook.
Thanks for contribution and merry Christmas.
|
Also, if a model is pre quantized, we should not use |
Yes, if model is already quantized, we should not add |
|
Hi @zhaochenyang20 , thank you for your valuable comments. I will update documentation according to your suggestions.
I am not sure how to explain to users when they should use online quantize or offline quantize. Is there any docs / blog / articles that I can refer to? @zhaochenyang20 @ispobock |
@JamesSand How about this? I don't see the docs for this right now. |
I will work on this part. |
zhaochenyang20
left a comment
There was a problem hiding this comment.
I asked Ke for his suggestion. Thanks for the contribution, and merry chrismas.
Great. Thanks so much. |
|
I think adding the |
|
|
@JamesSand Hey James. Have you included the discussions here? |
There was a problem hiding this comment.
Pretty cool!!!
Sorry for not saying this previously, please delete the logging of ipynb in your commit. You can use find . -name "*.ipynb" -exec nbstripout {} \; pip install nbstripout.
Add this rule to /sglang/tree/docs/docs.
Also, only keep ipynb and remove markdown.
| "source": [ | ||
| "Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n", | ||
| "\n", | ||
| "We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:\n", |
There was a problem hiding this comment.
Give link to torchao. Also, are you sure that this kind of online quantization is recommended?
There was a problem hiding this comment.
explain when to use torchao and how to do the right things.
| "source": [ | ||
| "# Quantization\n", | ||
| "\n", | ||
| "`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n", |
There was a problem hiding this comment.
State it here that we do not recommend online quantization (not for torchao), that's only for the acitivation.
| "\n", | ||
| "`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n", | ||
| "\n", | ||
| "Please visit [here](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface.\n", |
There was a problem hiding this comment.
This is pretty cool. But some model has official quantization. We'd better say that use the official quantized model, for example, llama xxx. If not, you can refer to xxx.
There was a problem hiding this comment.
Official, third party, and do it your self
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n", |
There was a problem hiding this comment.
Isn't AWQ already supported in --quantization? Don't see how activation stats are collected though
There was a problem hiding this comment.
I guess it supports AWQ, but the model must be pre-quantized (offline) and include a quant_config.json
| "```bash\n", | ||
| "python3 -m sglang.launch_server \\\n", | ||
| " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", | ||
| " --torchao-config int4wo-128 \\\n", | ||
| " --port 30000 --host 0.0.0.0\n", | ||
| "```\n", | ||
| "\n", | ||
| "which is equivalent to the following code block:" |
| "Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `\"int8dq\"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `\"int8dq\"` method. Namely, please use the following command:\n", | ||
| "\n", |
There was a problem hiding this comment.
(add --disable-cuda-graph)
| "```bash\n", | ||
| "python3 -m sglang.launch_server \\\n", | ||
| " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", | ||
| " --torchao-config int8dq \\\n", | ||
| " --disable-cuda-graph \\\n", | ||
| " --port 30000 --host 0.0.0.0\n", | ||
| "```" | ||
| ] |
| --port 30000 --host 0.0.0.0 | ||
| ``` | ||
|
|
||
| Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine. |
There was a problem hiding this comment.
add ** ** to them all.
|
We do not recommend online quantize, try to understand why and explain it? Do some exploration and maybe keep your nots here? |
|
How to do the "right" quantization. And how to choose between methods. One-line explanation and give links. |
| "\n", | ||
| "## Online Quantization\n", | ||
| "\n", | ||
| "> Note: Although we support online quantization, we recommend users to use quantized models. \n", |
There was a problem hiding this comment.
Say "Although we support online quantization, users are advised to load offline quantized weights (official or quantized by more advanced methods)"?
| @@ -0,0 +1,98 @@ | |||
| # Quantization | |||
|
|
|||
| `SGLang` support various quantization methods, including online dynamic quantization and offline quantization. | |||
There was a problem hiding this comment.
Explanation:
"Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.
Offline quantization saves pre-quantized model weights and loads during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set."
|
@JamesSand james, do you have time to confirm this? I think we are close to the end! |
|
@Edenzzzz Thanks a lot! |
|
Addressed by #3410 |
Motivation
According to issue "[Feature] Add Docs For Quantization" #2531 , this PR add documentation for quantization. cc @zhaochenyang20
Modifications
This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:
docs/backend/quantization.mddocs/index.rst, to inlcude the quantization docs into SGLang documentation.Checklist