Consolidate Nvidia ModelOpt quant config handling for all quantization methods#28076
Conversation
…n methods two major changes in this patch: 1. Consolidate Nvidia ModelOpt quant config handling across different quantization methods (FP8 and NVFP4 as of now). Especially the handling of exclude_modules. Different quantization methods share the same exclude_modules handling. Currently in the code we handle the exclude modules as patterns for the NVFP4 quant method but the FP8 quant format does not handle it. This change move the handling of exclude_modules and the determination of quant methods for layers in common code. 2. The Nvidia ModelOpt library exclude modules semantic is different from vllm's current logic. The exclude modules are wildcards instead of simple module prefix strings/substrings. Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request effectively consolidates the Nvidia ModelOpt quantization configuration handling for different methods like FP8 and NVFP4 into a new base class, ModelOptQuantConfigBase. This is a good refactoring that improves code reuse and maintainability. A key improvement is the unified handling of exclude_modules using fnmatch for wildcard matching, which aligns with the expected behavior of the ModelOpt library. The logic for determining quantization methods is also now cleanly centralized in the base class. I have identified a couple of areas where the implementation can be further improved for consistency and correctness, which are detailed in the specific comments.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This pull request has merge conflicts that must be resolved before it can be |
yewentao256
left a comment
There was a problem hiding this comment.
The idea looks good to me, could you also show lm_eval metrics to make sure we don't hurt accuracy?
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
I must be sleepy that I let these issues slide Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
…modelopt-exclude-modules
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
It's a staticmethod instead of a classmethod Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
these 2 methods should be instance methods instead of class/static Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
working on it, will pick some common model and post the results here. |
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
Thanks for merging it @mgoin , I was working on getting lm_eval results as @yewentao256 suggested. It took me quite some time because of hitting quite some issues but then turned out some of my configs were incorrect. Here's one result following this doc (really nice documentation by the way). In the doc the accuracy numbers (i.e. without this change) are:
With this change, it's:
Both run are on B200. So accuracy is not impacted. |
…n methods (vllm-project#28076) Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…n methods (vllm-project#28076) Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
…n methods (vllm-project#28076) Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Purpose
Two major changes in this patch:
Consolidate Nvidia ModelOpt quant config handling across different quantization methods (FP8 and NVFP4 as of now). Especially the handling of exclude_modules. Different quantization methods share the same exclude_modules handling. Currently in the code we handle the exclude modules as patterns for the NVFP4 quant method but the FP8 quant format does not handle it. This change move the handling of exclude_modules and the determination of quant methods for layers in common code.
The Nvidia ModelOpt library exclude modules semantic is different from vllm's current logic. The exclude modules are wildcards instead of simple module prefix strings/substrings.
Test Plan
Successfully deployed several LLM models, including Qwen2.5-VL, Qwen3-8B, Qwen3-VL, Llama3.1-70B, Llama4-Scout with both NVFP4 and FP8 quantization
Test Result
Some deployment hit issues that is not related to this change, which is reported here: #28072
Deployments all succeed with workaround to the above unrelated issue.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.