Consolidate Nvidia ModelOpt quant config handling for all quantization methods by shengliangxu · Pull Request #28076 · vllm-project/vllm

shengliangxu · 2025-11-04T23:05:49Z

Purpose

Two major changes in this patch:

Consolidate Nvidia ModelOpt quant config handling across different quantization methods (FP8 and NVFP4 as of now). Especially the handling of exclude_modules. Different quantization methods share the same exclude_modules handling. Currently in the code we handle the exclude modules as patterns for the NVFP4 quant method but the FP8 quant format does not handle it. This change move the handling of exclude_modules and the determination of quant methods for layers in common code.
The Nvidia ModelOpt library exclude modules semantic is different from vllm's current logic. The exclude modules are wildcards instead of simple module prefix strings/substrings.

Test Plan

Successfully deployed several LLM models, including Qwen2.5-VL, Qwen3-8B, Qwen3-VL, Llama3.1-70B, Llama4-Scout with both NVFP4 and FP8 quantization

Test Result

Some deployment hit issues that is not related to this change, which is reported here: #28072

Deployments all succeed with workaround to the above unrelated issue.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…n methods two major changes in this patch: 1. Consolidate Nvidia ModelOpt quant config handling across different quantization methods (FP8 and NVFP4 as of now). Especially the handling of exclude_modules. Different quantization methods share the same exclude_modules handling. Currently in the code we handle the exclude modules as patterns for the NVFP4 quant method but the FP8 quant format does not handle it. This change move the handling of exclude_modules and the determination of quant methods for layers in common code. 2. The Nvidia ModelOpt library exclude modules semantic is different from vllm's current logic. The exclude modules are wildcards instead of simple module prefix strings/substrings. Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

github-actions · 2025-11-04T23:05:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request effectively consolidates the Nvidia ModelOpt quantization configuration handling for different methods like FP8 and NVFP4 into a new base class, ModelOptQuantConfigBase. This is a good refactoring that improves code reuse and maintainability. A key improvement is the unified handling of exclude_modules using fnmatch for wildcard matching, which aligns with the expected behavior of the ModelOpt library. The logic for determining quantization methods is also now cleanly centralized in the base class. I have identified a couple of areas where the implementation can be further improved for consistency and correctness, which are detailed in the specific comments.

vllm/model_executor/layers/quantization/modelopt.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/modelopt.py

heheda12345 · 2025-11-06T07:48:32Z

CC @mgoin @yewentao256

vllm/model_executor/layers/quantization/modelopt.py

mergify · 2025-11-07T13:40:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shengliangxu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

The idea looks good to me, could you also show lm_eval metrics to make sure we don't hurt accuracy?