[Docs] add quantization docs by FlamingoPg · Pull Request #3253 · sgl-project/sglang

FlamingoPg · 2025-02-01T15:06:16Z

Motivation

Add docs for quantization. This PR is Change from previous one : #2572. cc: @zhaochenyang20

Modifications

This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:

Add document of quantization docs/backend/quantization.md
Modified docs/index.rst, to inlcude the quantization docs into SGLang documentation.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

zhaochenyang20 · 2025-02-01T17:57:38Z

I personally think we should only keep the markdown file, since quantization ipynb would relaunch the kernel multiple times and this is not a basic feature. @shuaills

zhaochenyang20 · 2025-02-01T18:13:38Z

@yinfan98 Fix lint and remove ipynb. thansk

Edenzzzz · 2025-02-04T04:44:20Z

+To do offline quantization for your model, firstly you need to install [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
+
+```bash
+pip install llmcompressor
+```
+
+Here, we take quantize `meta-llama/Meta-Llama-3-8B-Instruct` to `FP8` as an example to elaborate on how to do offline quantization.
+


Add example to load officially (or community) quantized weights, such as by AWQ?

Fridge003 · 2025-02-04T19:12:21Z

+
+`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.
+
+Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.


Add a hyperlink to Nvidia delayed scaling example in the document?

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8

Fridge003 · 2025-02-04T19:17:16Z

+
+## Online Quantization
+
+> Note: Although we support online quantization, we recommend users to use quantized models.


Maybe can be modified to "we recommend users to use offline quantized models" to avoid confusion

Fridge003 · 2025-02-04T19:26:33Z

+    --port 30000 --host 0.0.0.0
+```
+
+**Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine.**


Might be "If the model has already been quantized offline"

zhaochenyang20 · 2025-02-04T20:53:33Z

Thanks @Edenzzzz @Fridge003

Fridge003 · 2025-02-21T09:11:08Z

Addressed by #3410

FlamingoPg added 15 commits February 1, 2025 22:17

Create quantization.md

29b580f

Create quantization.ipynb

0fc00f8

Update quantization.ipynb

24ee864

Update quantization.ipynb

89c390b

Update quantization.ipynb

1412ec9

Update quantization.ipynb

9777806

Update quantization.ipynb

020ef7a

Update quantization.ipynb

b2a713a

Update quantization.ipynb

2a8c33b

Update quantization.ipynb

1e771e7

Update quantization.ipynb

e55d75e

Update quantization.ipynb

3cc21e4

Update quantization.ipynb

1245321

Update quantization.ipynb

1fe89de

Update quantization.ipynb

e4f3253

Edenzzzz reviewed Feb 4, 2025

View reviewed changes

Fridge003 reviewed Feb 4, 2025

View reviewed changes

Merge branch 'main' into quantize-docs

7bd7d2a

Edenzzzz mentioned this pull request Feb 8, 2025

[Docs] Add quantization docs #3410

Merged

5 tasks

Fridge003 closed this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] add quantization docs#3253

[Docs] add quantization docs#3253
FlamingoPg wants to merge 16 commits into
sgl-project:mainfrom
FlamingoPg:quantize-docs

FlamingoPg commented Feb 1, 2025

Uh oh!

zhaochenyang20 commented Feb 1, 2025

Uh oh!

zhaochenyang20 commented Feb 1, 2025

Uh oh!

Edenzzzz Feb 4, 2025 •

edited

Loading

Uh oh!

Fridge003 Feb 4, 2025 •

edited

Loading

Uh oh!

Edenzzzz Feb 4, 2025

Uh oh!

Fridge003 Feb 4, 2025

Uh oh!

Fridge003 Feb 4, 2025

Uh oh!

zhaochenyang20 commented Feb 4, 2025

Uh oh!

Fridge003 commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.

		Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.


		## Online Quantization

		> Note: Although we support online quantization, we recommend users to use quantized models.

Conversation

FlamingoPg commented Feb 1, 2025

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 commented Feb 1, 2025

Uh oh!

zhaochenyang20 commented Feb 1, 2025

Uh oh!

Edenzzzz Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fridge003 Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Feb 4, 2025

Uh oh!

Fridge003 commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Edenzzzz Feb 4, 2025 •

edited

Loading

Fridge003 Feb 4, 2025 •

edited

Loading