Skip to content

[Docs] add quantization docs#2572

Closed
JamesSand wants to merge 15 commits into
sgl-project:mainfrom
JamesSand:docs
Closed

[Docs] add quantization docs#2572
JamesSand wants to merge 15 commits into
sgl-project:mainfrom
JamesSand:docs

Conversation

@JamesSand

@JamesSand JamesSand commented Dec 25, 2024

Copy link
Copy Markdown
Contributor

Motivation

According to issue "[Feature] Add Docs For Quantization" #2531 , this PR add documentation for quantization. cc @zhaochenyang20

Modifications

This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:

  1. Add document of quantization docs/backend/quantization.md
  2. Modified docs/index.rst, to inlcude the quantization docs into SGLang documentation.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhaochenyang20 zhaochenyang20 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell the users when should we prefer offline
quantization rather than online. For example, ke @ispobock told me that for Deepseek V2.5, we should always do offline rather than online. Could you please let the users know when and why we should do this?

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

Lastly, I suggest that this should be a ipynb rather than a markdown file. Since there are several things can be run in jupyter notebook.

Thanks for contribution and merry Christmas.

Comment thread docs/index.rst Outdated
Comment thread docs/backend/quantization.md Outdated
@zhaochenyang20

Copy link
Copy Markdown
Collaborator

Also, if a model is pre quantized, we should not use --quantize, right?

@JamesSand

JamesSand commented Dec 26, 2024

Copy link
Copy Markdown
Contributor Author

Also, if a model is pre quantized, we should not use --quantize, right?

Yes, if model is already quantized, we should not add --quantization argument

@JamesSand

Copy link
Copy Markdown
Contributor Author

Hi @zhaochenyang20 , thank you for your valuable comments. I will update documentation according to your suggestions.

Tell the users when should we prefer offline
quantization rather than online. For example, ke @ispobock told me that for Deepseek V2.5, we should always do offline rather than online. Could you please let the users know when and why we should do this?

I am not sure how to explain to users when they should use online quantize or offline quantize. Is there any docs / blog / articles that I can refer to? @zhaochenyang20 @ispobock

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

@JamesSand

JamesSand commented Dec 26, 2024

Copy link
Copy Markdown
Contributor Author

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

I will work on this part.

@zhaochenyang20 zhaochenyang20 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Ke for his suggestion. Thanks for the contribution, and merry chrismas.

Comment thread docs/backend/quantization.md Outdated
Comment thread docs/backend/quantization.md Outdated
Comment thread docs/backend/quantization.md
Comment thread docs/backend/quantization.md Outdated
Comment thread docs/backend/quantization.md
Comment thread docs/backend/quantization.md Outdated
@zhaochenyang20

Copy link
Copy Markdown
Collaborator

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

I will work on this part today.

Great. Thanks so much.

@ispobock

Copy link
Copy Markdown
Collaborator

I think adding the --quantization arg cannot quantize the model weights online. You need to use some quantization tools to do the offline conversion. The dynamic quantization seems only for activations. @HandH1998 Could you help verify this?

@HandH1998

Copy link
Copy Markdown
Collaborator

--quantization only indicates the quantization method you want to use.
For certain quantization methods, there is an online weight quantization process that occurs when loading weights from a checkpoint that is not in the specified quantization format, but not for all methods. For example, the code at this link demonstrates this for FP8 quantization. However, online quantization is typically straightforward and quick, and it is not recommended at this time for now.

@zhyncs zhyncs added the dependencies Pull requests that update a dependency file label Dec 27, 2024
@zhaochenyang20

Copy link
Copy Markdown
Collaborator

--quantization only indicates the quantization method you want to use. For certain quantization methods, there is an online weight quantization process that occurs when loading weights from a checkpoint that is not in the specified quantization format, but not for all methods. For example, the code at this link demonstrates this for FP8 quantization. However, online quantization is typically straightforward and quick, and it is not recommended at this time for now.

@JamesSand Hey James. Have you included the discussions here?

@zhaochenyang20 zhaochenyang20 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool!!!

Sorry for not saying this previously, please delete the logging of ipynb in your commit. You can use find . -name "*.ipynb" -exec nbstripout {} \; pip install nbstripout.

Add this rule to /sglang/tree/docs/docs.

Also, only keep ipynb and remove markdown.

"source": [
"Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n",
"\n",
"We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give link to torchao. Also, are you sure that this kind of online quantization is recommended?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain when to use torchao and how to do the right things.

"source": [
"# Quantization\n",
"\n",
"`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State it here that we do not recommend online quantization (not for torchao), that's only for the acitivation.

"\n",
"`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n",
"\n",
"Please visit [here](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface.\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool. But some model has official quantization. We'd better say that use the official quantized model, for example, llama xxx. If not, you can refer to xxx.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Official, third party, and do it your self

"cell_type": "markdown",
"metadata": {},
"source": [
"Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that + xxxx

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ** **

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't AWQ already supported in --quantization? Don't see how activation stats are collected though

@Edenzzzz Edenzzzz Jan 6, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it supports AWQ, but the model must be pre-quantized (offline) and include a quant_config.json

Comment on lines +106 to +113
"```bash\n",
"python3 -m sglang.launch_server \\\n",
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --torchao-config int4wo-128 \\\n",
" --port 30000 --host 0.0.0.0\n",
"```\n",
"\n",
"which is equivalent to the following code block:"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this.

Comment on lines +193 to +194
"Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `\"int8dq\"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `\"int8dq\"` method. Namely, please use the following command:\n",
"\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(add --disable-cuda-graph)

Comment on lines +195 to +202
"```bash\n",
"python3 -m sglang.launch_server \\\n",
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --torchao-config int8dq \\\n",
" --disable-cuda-graph \\\n",
" --port 30000 --host 0.0.0.0\n",
"```"
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this.

--port 30000 --host 0.0.0.0
```

Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ** ** to them all.

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

We do not recommend online quantize, try to understand why and explain it? Do some exploration and maybe keep your nots here?

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial

@zhaochenyang20

zhaochenyang20 commented Dec 28, 2024

Copy link
Copy Markdown
Collaborator

How to do the "right" quantization. And how to choose between methods. One-line explanation and give links.

"\n",
"## Online Quantization\n",
"\n",
"> Note: Although we support online quantization, we recommend users to use quantized models. \n",

@Edenzzzz Edenzzzz Jan 6, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say "Although we support online quantization, users are advised to load offline quantized weights (official or quantized by more advanced methods)"?

@@ -0,0 +1,98 @@
# Quantization

`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.

@Edenzzzz Edenzzzz Jan 6, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation:
"Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.
Offline quantization saves pre-quantized model weights and loads during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set."

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

@JamesSand james, do you have time to confirm this? I think we are close to the end!

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

@Edenzzzz Thanks a lot!

@FlamingoPg FlamingoPg mentioned this pull request Feb 1, 2025
4 tasks
@Fridge003

Copy link
Copy Markdown
Collaborator

Addressed by #3410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants