Skip to content

[Docs] add quantization docs#3253

Closed
FlamingoPg wants to merge 16 commits into
sgl-project:mainfrom
FlamingoPg:quantize-docs
Closed

[Docs] add quantization docs#3253
FlamingoPg wants to merge 16 commits into
sgl-project:mainfrom
FlamingoPg:quantize-docs

Conversation

@FlamingoPg

Copy link
Copy Markdown
Collaborator

Motivation

Add docs for quantization. This PR is Change from previous one : #2572. cc: @zhaochenyang20

Modifications

This PR adds documentation for enabling online quantization and offline quantization using SGLang.
The modifications can be summarized as follows:

Add document of quantization docs/backend/quantization.md
Modified docs/index.rst, to inlcude the quantization docs into SGLang documentation.

Checklist

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

I personally think we should only keep the markdown file, since quantization ipynb would relaunch the kernel multiple times and this is not a basic feature. @shuaills

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

@yinfan98 Fix lint and remove ipynb. thansk

Comment on lines +50 to +57
To do offline quantization for your model, firstly you need to install [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

```bash
pip install llmcompressor
```

Here, we take quantize `meta-llama/Meta-Llama-3-8B-Instruct` to `FP8` as an example to elaborate on how to do offline quantization.

@Edenzzzz Edenzzzz Feb 4, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add example to load officially (or community) quantized weights, such as by AWQ?


`SGLang` support various quantization methods, including online dynamic quantization and offline quantization.

Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the delayed scaling in NVIDIA FP8 training. For inference this quantizes the model once on loading.

@Fridge003 Fridge003 Feb 4, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a hyperlink to Nvidia delayed scaling example in the document?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Online Quantization

> Note: Although we support online quantization, we recommend users to use quantized models.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can be modified to "we recommend users to use offline quantized models" to avoid confusion

--port 30000 --host 0.0.0.0
```

**Note: If the model has already quantized offline, please **do not** add `--quantization` argument when starting the engine.**

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be "If the model has already been quantized offline"

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

Thanks @Edenzzzz @Fridge003

@Edenzzzz Edenzzzz mentioned this pull request Feb 8, 2025
5 tasks
@Fridge003

Copy link
Copy Markdown
Collaborator

Addressed by #3410

@Fridge003 Fridge003 closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants