[Docs] add quantization docs by JamesSand · Pull Request #2572 · sgl-project/sglang

JamesSand · 2024-12-25T12:59:54Z

Motivation

According to issue "[Feature] Add Docs For Quantization" #2531 , this PR add documentation for quantization. cc @zhaochenyang20

Modifications

This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:

Add document of quantization docs/backend/quantization.md
Modified docs/index.rst, to inlcude the quantization docs into SGLang documentation.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhaochenyang20

Tell the users when should we prefer offline
quantization rather than online. For example, ke @ispobock told me that for Deepseek V2.5, we should always do offline rather than online. Could you please let the users know when and why we should do this?

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

Lastly, I suggest that this should be a ipynb rather than a markdown file. Since there are several things can be run in jupyter notebook.

Thanks for contribution and merry Christmas.

zhaochenyang20 · 2024-12-25T20:00:56Z

Also, if a model is pre quantized, we should not use --quantize, right?

JamesSand · 2024-12-26T02:44:41Z

Also, if a model is pre quantized, we should not use --quantize, right?

Yes, if model is already quantized, we should not add --quantization argument

JamesSand · 2024-12-26T03:15:13Z

Hi @zhaochenyang20 , thank you for your valuable comments. I will update documentation according to your suggestions.

Tell the users when should we prefer offline
quantization rather than online. For example, ke @ispobock told me that for Deepseek V2.5, we should always do offline rather than online. Could you please let the users know when and why we should do this?

I am not sure how to explain to users when they should use online quantize or offline quantize. Is there any docs / blog / articles that I can refer to? @zhaochenyang20 @ispobock

zhaochenyang20 · 2024-12-26T03:31:54Z

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

JamesSand · 2024-12-26T03:37:52Z

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

I will work on this part.

zhaochenyang20

I asked Ke for his suggestion. Thanks for the contribution, and merry chrismas.

zhaochenyang20 · 2024-12-26T03:42:18Z

Also, tell the users that a lot of model are already quantized on Hugging already. They should first find these officially quantized model, rather than do it themselves.

@JamesSand How about this? I don't see the docs for this right now.

I will work on this part today.

Great. Thanks so much.

ispobock · 2024-12-26T16:12:44Z

I think adding the --quantization arg cannot quantize the model weights online. You need to use some quantization tools to do the offline conversion. The dynamic quantization seems only for activations. @HandH1998 Could you help verify this?

HandH1998 · 2024-12-27T03:50:38Z

--quantization only indicates the quantization method you want to use.
For certain quantization methods, there is an online weight quantization process that occurs when loading weights from a checkpoint that is not in the specified quantization format, but not for all methods. For example, the code at this link demonstrates this for FP8 quantization. However, online quantization is typically straightforward and quick, and it is not recommended at this time for now.

zhaochenyang20 · 2024-12-28T04:44:28Z

--quantization only indicates the quantization method you want to use. For certain quantization methods, there is an online weight quantization process that occurs when loading weights from a checkpoint that is not in the specified quantization format, but not for all methods. For example, the code at this link demonstrates this for FP8 quantization. However, online quantization is typically straightforward and quick, and it is not recommended at this time for now.

@JamesSand Hey James. Have you included the discussions here?

zhaochenyang20

Pretty cool!!!

Sorry for not saying this previously, please delete the logging of ipynb in your commit. You can use find . -name "*.ipynb" -exec nbstripout {} \; pip install nbstripout.

Add this rule to /sglang/tree/docs/docs.

Also, only keep ipynb and remove markdown.

zhaochenyang20 · 2024-12-28T04:46:36Z

+   "source": [
+    "Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n",
+    "\n",
+    "We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:\n",


Give link to torchao. Also, are you sure that this kind of online quantization is recommended?

explain when to use torchao and how to do the right things.

zhaochenyang20 · 2024-12-28T04:47:48Z

+   "source": [
+    "# Quantization\n",
+    "\n",
+    "`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n",


State it here that we do not recommend online quantization (not for torchao), that's only for the acitivation.

zhaochenyang20 · 2024-12-28T04:49:14Z

+    "\n",
+    "`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.\n",
+    "\n",
+    "Please visit [here](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface.\n",


This is pretty cool. But some model has official quantization. We'd better say that use the official quantized model, for example, llama xxx. If not, you can refer to xxx.

Official, third party, and do it your self

zhaochenyang20 · 2024-12-28T04:51:19Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our team is working on supporting more quantization methods. We will soon support other quantization methods including but not limited to `[\"awq\", \"gptq\", \"marlin\", \"gptq_marlin\", \"awq_marlin\", \"bitsandbytes\", \"gguf\"]`\n",


Note that + xxxx

Isn't AWQ already supported in --quantization? Don't see how activation stats are collected though

I guess it supports AWQ, but the model must be pre-quantized (offline) and include a quant_config.json

zhaochenyang20 · 2024-12-28T04:51:39Z

+    "```bash\n",
+    "python3 -m sglang.launch_server \\\n",
+    "    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
+    "    --torchao-config int4wo-128 \\\n",
+    "    --port 30000 --host 0.0.0.0\n",
+    "```\n",
+    "\n",
+    "which is equivalent to the following code block:"


Delete this.

zhaochenyang20 · 2024-12-28T04:53:28Z

+    "Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `\"int8dq\"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `\"int8dq\"` method. Namely, please use the following command:\n",
+    "\n",


(add --disable-cuda-graph)

zhaochenyang20 · 2024-12-28T04:53:38Z

+    "```bash\n",
+    "python3 -m sglang.launch_server \\\n",
+    "    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
+    "    --torchao-config int8dq \\\n",
+    "    --disable-cuda-graph \\\n",
+    "    --port 30000 --host 0.0.0.0\n",
+    "```"
+   ]


Delete this.

zhaochenyang20 · 2024-12-28T04:54:41Z

+    --port 30000 --host 0.0.0.0
+```
+
+Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine.


add ** ** to them all.

zhaochenyang20 · 2024-12-28T05:13:59Z

We do not recommend online quantize, try to understand why and explain it? Do some exploration and maybe keep your nots here?

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial

zhaochenyang20 · 2024-12-28T05:17:30Z

How to do the "right" quantization. And how to choose between methods. One-line explanation and give links.

Edenzzzz · 2025-01-06T06:53:42Z

+    "\n",
+    "## Online Quantization\n",
+    "\n",
+    "> Note: Although we support online quantization, we recommend users to use quantized models. \n",


Say "Although we support online quantization, users are advised to load offline quantized weights (official or quantized by more advanced methods)"?

Edenzzzz · 2025-01-06T07:19:45Z

@@ -0,0 +1,98 @@
+# Quantization
+
+`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.


Explanation:
"Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.
Offline quantization saves pre-quantized model weights and loads during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set."

zhaochenyang20 · 2025-01-06T16:45:26Z

@JamesSand james, do you have time to confirm this? I think we are close to the end!

zhaochenyang20 · 2025-01-06T16:46:12Z

@Edenzzzz Thanks a lot!

Fridge003 · 2025-02-21T09:15:38Z

Addressed by #3410

JamesSand added 2 commits December 25, 2024 12:44

[docs] add quantization docs

8b809d8

[Docs]: add quantization docs. Passed pre-commit test

e066db5

zhaochenyang20 requested changes Dec 25, 2024

View reviewed changes

Comment thread docs/index.rst Outdated

Tushar-ml reviewed Dec 25, 2024

View reviewed changes

Comment thread docs/backend/quantization.md Outdated

Merge branch 'main' into docs

1e188de

[Docs] fix --quantization args in qunatize docs

ee9a459

Merge branch 'main' into docs

dd3b3c9

zhaochenyang20 requested changes Dec 26, 2024

View reviewed changes

merrymercy assigned zhaochenyang20 Dec 26, 2024

ispobock assigned zhyncs and HandH1998 Dec 26, 2024

JamesSand added 7 commits December 27, 2024 05:49

[Docs] fix comments from Chenyang. Before grammar check

5ac7a83

[Docs] grammar checked quant docs

d37a875

[Docs] add links to huggingface quantized models

d39f2d2

Merge branch 'main' into docs

2637867

[Docs] add notebook version docs for quant

fdd0be2

[Docs] add notebook version quant docs

ea3a969

[Docs] add quant notebook docs to index.rst

0fb5bec

zhyncs added the dependencies Pull requests that update a dependency file label Dec 27, 2024

Merge branch 'main' into docs

faed66e

zhaochenyang20 requested changes Dec 28, 2024

View reviewed changes

Merge branch 'main' into docs

2278c12

Edenzzzz reviewed Jan 6, 2025

View reviewed changes

Merge branch 'main' into docs

4cfab09

FlamingoPg mentioned this pull request Feb 1, 2025

[Docs] add quantization docs #3253

Closed

4 tasks

Fridge003 closed this Feb 21, 2025

functionstackx mentioned this pull request May 18, 2026

[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4) #25564

Closed

		"Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `\"int8dq\"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `\"int8dq\"` method. Namely, please use the following command:\n",
		"\n",

		@@ -0,0 +1,98 @@
		# Quantization

		`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.

Conversation

JamesSand commented Dec 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 25, 2024

Uh oh!

JamesSand commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesSand commented Dec 26, 2024

Uh oh!

zhaochenyang20 commented Dec 26, 2024

Uh oh!

JamesSand commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 26, 2024

Uh oh!

ispobock commented Dec 26, 2024

Uh oh!

HandH1998 commented Dec 27, 2024

Uh oh!

zhaochenyang20 commented Dec 28, 2024

Uh oh!

zhaochenyang20 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Dec 28, 2024

Uh oh!

zhaochenyang20 commented Dec 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JamesSand commented Dec 25, 2024 •

edited

Loading

JamesSand commented Dec 26, 2024 •

edited

Loading

JamesSand commented Dec 26, 2024 •

edited

Loading

zhaochenyang20 left a comment •

edited

Loading

Edenzzzz Jan 6, 2025 •

edited

Loading

zhaochenyang20 commented Dec 28, 2024 •

edited

Loading

Edenzzzz Jan 6, 2025 •

edited

Loading

Edenzzzz Jan 6, 2025 •

edited

Loading