Enable native ModelOpt quantization support (3/3) by Edwardf0t1 · Pull Request #10154 · sgl-project/sglang

Edwardf0t1 · 2025-09-08T07:24:54Z

This is the third PR in a three-part series to enable native ModelOpt quantization in SGLang. It includes changes from the first PR (#7149) and second PR (#9991) and will be rebased once the first two PRs are merged.

Motivation

We aim to enhance SGLang's quantization capabilities, making ModelOpt integration more robust and user-friendly while providing checkpoint persistence for better performance in production environments.

Modifications

Integrated modelopt quantized model export functionalities.
Added modelopt_export_path parameter to _setup_modelopt_quantization() in ModelOptModelLoader.
Implemented _export_modelopt_checkpoint() method using modelopt's unified hf export API.
Added modelopt_export_path parameter in ModelConfig and added --modelopt-export-path command-line argument in ServerArgs.
Export happens automatically after quantization (or when restoring from checkpoint).
Added unit tests for the export functionalities.
Unified quantization flags in quantize + export and deployment phases.
Added an example script to run modelopt quantize + export + deployment.
TODO: Enable a quantize-and-serve mode for quantize + export + deployment with a single command.

Accuracy Tests

Production Workflow:

# Step 1: Quantize + Export
python examples/usage/modelopt_quantize_and_export.py quantize \
    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --export-dir ./quantized_tinyllama_fp8 \
    --quantization-method modelopt_fp8

# Step 2: Deploy
python -m sglang.launch_server \
    --model-path ./quantized_tinyllama_fp8 \
    --quantization modelopt

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Summary by CodeRabbit

New Features
- Added NVIDIA ModelOpt quantization support (FP8/FP4 auto-detection), export to Hugging Face format, and serving of exported models.
- Introduced CLI options to export after quantization and to quantize-and-serve.
- Added quantization choice: modelopt_fp8.
- Included an example script demonstrating quantize, export, and deploy.
Documentation
- New guide “Using NVIDIA ModelOpt” covering installation, workflow, Python usage, deployment, and advanced features; reference updated.
Tests
- Expanded coverage for ModelOpt workflows and additional model/attention components.
Chores
- Added optional dependency group for ModelOpt.

Edwardf0t1 · 2025-09-13T00:31:22Z

@zhyncs @Qiaolin-Yu Please help or find someone review this PR as well when you get a chance. Thank you!

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…tionality, add ModelOpt fields to for checkpoint and export paths Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

FlamingoPg · 2025-10-21T07:26:33Z

Looks good

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

`modelopt_quant` and `modelopt_export_path` were removed from ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization` flag and LoadConfig.modelopt_export_path), but the test was never updated. It stayed latent because the class is skipped when nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI image yesterday, which exposed the failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`modelopt_quant` and `modelopt_export_path` were removed from ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization` flag and LoadConfig.modelopt_export_path), but the test was never updated. It stayed latent because the class is skipped when nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI image, which exposed the failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Edwardf0t1 mentioned this pull request Sep 9, 2025

Enable native ModelOpt quantization support (3/3) Edwardf0t1/sglang#2

Open

4 tasks

Edwardf0t1 marked this pull request as ready for review September 9, 2025 08:12

Edwardf0t1 requested review from BBuf, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners September 9, 2025 08:12

This was referenced Sep 12, 2025

Enable native ModelOpt quantization support (1/3) #7149

Merged

Enable native ModelOpt quantization support (2/3) #9991

Merged

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from e97069f to 19fcedb Compare September 13, 2025 00:29

Edwardf0t1 requested review from CatherineSue and slin1237 as code owners September 13, 2025 00:29

Qiaolin-Yu self-assigned this Sep 13, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 19fcedb to 95fc54b Compare September 13, 2025 01:48

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 95fc54b to d25e5d1 Compare September 23, 2025 08:18

Qiaolin-Yu reviewed Sep 24, 2025

View reviewed changes

Comment thread test/srt/test_modelopt_loader.py

Comment thread python/sglang/srt/configs/model_config.py

Comment thread examples/usage/modelopt_quantize_and_export.py

Comment thread python/sglang/srt/model_loader/loader.py Outdated

Comment thread python/sglang/srt/model_loader/loader.py Outdated

Edwardf0t1 added the high priority label Sep 26, 2025

Edwardf0t1 self-assigned this Sep 26, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from d25e5d1 to a9e4353 Compare September 26, 2025 06:25

Edwardf0t1 requested a review from JustinTong0323 as a code owner September 26, 2025 06:25

Edwardf0t1 commented Sep 26, 2025

View reviewed changes

Comment thread python/pyproject.toml

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch 2 times, most recently from c5181b3 to 15dd13e Compare September 30, 2025 05:34

b8zhong added the run-ci label Oct 6, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 15dd13e to 9c2eaac Compare October 8, 2025 08:06

Edwardf0t1 added 12 commits October 18, 2025 09:17

resolve conflict

7d4dc6f

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update example

25849ab

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update quantization doc page

94fa757

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

resolve conflict

e474937

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix ci

2ce4559

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

minor

0c7293d

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

minor

1f4c855

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix ci

644a0ef

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

extract common override_quantization_method logics to a shared method

7be94e4

create a helper function to detect hf_quant_config.json

c146005

create a helper function _maybe_export_modelopt()

ac9e0f8

fix ci

3cafa90

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 9b8dc42 to 3cafa90 Compare October 18, 2025 09:18

minor

cd70e8e

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

merrymercy approved these changes Oct 20, 2025

View reviewed changes

Comment thread python/sglang/srt/configs/model_config.py Outdated

create a dedicated class that encapsulates all ModelOpt-specific func…

77e4c69

…tionality, add ModelOpt fields to for checkpoint and export paths Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 requested a review from FlamingoPg as a code owner October 20, 2025 21:51

Edwardf0t1 added 2 commits October 21, 2025 01:09

add the missing modelopt_config.py

956a0fb

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update example script and doc

725462a

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 enabled auto-merge (squash) October 21, 2025 03:56

b8zhong mentioned this pull request Oct 21, 2025

Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend #11928

Merged

merrymercy disabled auto-merge October 21, 2025 22:35

Merge branch 'main' into zhiyu/modelopt-sglang-api-3

f9a8e17

merrymercy merged commit 80b2b32 into sgl-project:main Oct 22, 2025
69 of 72 checks passed

xjpang pushed a commit to xjpang/sglang that referenced this pull request Oct 22, 2025

Enable native ModelOpt quantization support (3/3) (sgl-project#10154)

b27cfda

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Copilot AI mentioned this pull request Feb 9, 2026

Add comprehensive NVFP4 KVCache technical documentation and upstream PR/issue analysis yiliu30/sglang-fork#2

Draft

5 tasks

Kangyan-Zhou mentioned this pull request Apr 20, 2026

Fix test_modelopt_export using stale ModelConfig kwargs #23214

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable native ModelOpt quantization support (3/3)#10154

Enable native ModelOpt quantization support (3/3)#10154
merrymercy merged 54 commits intosgl-project:mainfrom
Edwardf0t1:zhiyu/modelopt-sglang-api-3

Edwardf0t1 commented Sep 8, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

Edwardf0t1 commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

Edwardf0t1 commented Sep 8, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

Uh oh!

Edwardf0t1 commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Edwardf0t1 commented Sep 8, 2025 •

edited by coderabbitai Bot

Loading