Skip to content

Consolidate Nvidia ModelOpt quant config handling for all quantization methods#28076

Merged
mgoin merged 14 commits intovllm-project:mainfrom
shengliangxu:shengliangx/modelopt-exclude-modules
Nov 20, 2025
Merged

Consolidate Nvidia ModelOpt quant config handling for all quantization methods#28076
mgoin merged 14 commits intovllm-project:mainfrom
shengliangxu:shengliangx/modelopt-exclude-modules

Conversation

@shengliangxu
Copy link
Copy Markdown
Contributor

@shengliangxu shengliangxu commented Nov 4, 2025

Purpose

Two major changes in this patch:

  1. Consolidate Nvidia ModelOpt quant config handling across different quantization methods (FP8 and NVFP4 as of now). Especially the handling of exclude_modules. Different quantization methods share the same exclude_modules handling. Currently in the code we handle the exclude modules as patterns for the NVFP4 quant method but the FP8 quant format does not handle it. This change move the handling of exclude_modules and the determination of quant methods for layers in common code.

  2. The Nvidia ModelOpt library exclude modules semantic is different from vllm's current logic. The exclude modules are wildcards instead of simple module prefix strings/substrings.

Test Plan

Successfully deployed several LLM models, including Qwen2.5-VL, Qwen3-8B, Qwen3-VL, Llama3.1-70B, Llama4-Scout with both NVFP4 and FP8 quantization

Test Result

Some deployment hit issues that is not related to this change, which is reported here: #28072

Deployments all succeed with workaround to the above unrelated issue.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…n methods

two major changes in this patch:

1. Consolidate Nvidia ModelOpt quant config handling across different
   quantization methods (FP8 and NVFP4 as of now). Especially the
   handling of exclude_modules. Different quantization methods share the
   same exclude_modules handling. Currently in the code we handle the
   exclude modules as patterns for the NVFP4 quant method but the FP8
   quant format does not handle it. This change move the handling of
   exclude_modules and the determination of quant methods for layers in
   common code.

2. The Nvidia ModelOpt library exclude modules semantic is different
   from vllm's current logic. The exclude modules are wildcards instead
   of simple module prefix strings/substrings.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively consolidates the Nvidia ModelOpt quantization configuration handling for different methods like FP8 and NVFP4 into a new base class, ModelOptQuantConfigBase. This is a good refactoring that improves code reuse and maintainability. A key improvement is the unified handling of exclude_modules using fnmatch for wildcard matching, which aligns with the expected behavior of the ModelOpt library. The logic for determining quantization methods is also now cleanly centralized in the base class. I have identified a couple of areas where the implementation can be further improved for consistency and correctness, which are detailed in the specific comments.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@heheda12345
Copy link
Copy Markdown
Collaborator

CC @mgoin @yewentao256

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Nov 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shengliangxu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 7, 2025
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good to me, could you also show lm_eval metrics to make sure we don't hurt accuracy?

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@mergify mergify bot removed the needs-rebase label Nov 18, 2025
shengliangxu and others added 6 commits November 18, 2025 17:56
I must be sleepy that I let these issues slide

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
It's a staticmethod instead of a classmethod

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
these 2 methods should be instance methods instead of class/static

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
shengliangxu and others added 3 commits November 18, 2025 22:09
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@shengliangxu
Copy link
Copy Markdown
Contributor Author

The idea looks good to me, could you also show lm_eval metrics to make sure we don't hurt accuracy?

working on it, will pick some common model and post the results here.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 20, 2025
@mgoin mgoin merged commit a8c5368 into vllm-project:main Nov 20, 2025
51 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 20, 2025
@shengliangxu
Copy link
Copy Markdown
Contributor Author

shengliangxu commented Nov 21, 2025

Thanks for merging it @mgoin , I was working on getting lm_eval results as @yewentao256 suggested. It took me quite some time because of hitting quite some issues but then turned out some of my configs were incorrect. Here's one result following this doc (really nice documentation by the way).

In the doc the accuracy numbers (i.e. without this change) are:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8946 ± 0.0085
strict-match 5 exact_match 0.8779 ± 0.0090

With this change, it's:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8992 ± 0.0083
strict-match 5 exact_match 0.8795 ± 0.0090

Both run are on B200.

So accuracy is not impacted.

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
…n methods (vllm-project#28076)

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…n methods (vllm-project#28076)

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…n methods (vllm-project#28076)

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants