[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py by chengyinie · Pull Request #38288 · vllm-project/vllm

chengyinie · 2026-03-26T22:17:13Z

Purpose

Consolidate GPTQ quantization by renaming gptq_marlin.py to auto_gptq.py as requested in #37765.
Key changes:

Renames gptq_marlin.py → auto_gptq.py
GPTQMarlinConfig.get_name() now returns "auto_gptq"
Adds override_quantization_method() to auto-convert models with quant_method: gptq to use auto_gptq
Updates all imports across the codebase to use auto_gptq module
Adds auto_gptq to overrides list in model.py
Adds test_auto_gptq.py for the new quantization method
Updates test_gptq_marlin.py skipif condition to check auto_gptq support
Maintains backward compatibility: quantization="gptq" still works (maps to GPTQMarlinConfig)
Closes [Feature]: Consolidate GPTQ Quantization #37765

Test Plan

pytest tests/quantization/test_auto_gptq.py -v -s
pytest tests/quantization/test_gptq_dynamic.py -v -s
pytest tests/quantization/test_lm_head.py -v -s
pytest tests/models/quantization/test_gptq_marlin.py -v -s

## Test Result
 All passed.
---
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [*] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [*] The test plan, such as providing test command.
- [*] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model.
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0).
</details>

gemini-code-assist

Code Review

This pull request renames the gptq_marlin quantization method to auto_gptq across the codebase, updating tests, imports, and configuration logic. It also introduces a mechanism for backward compatibility to handle existing gptq and gptq_marlin configurations. Feedback suggests that the QuantizationMethods literal should still include the legacy names to prevent validation errors when they are passed as arguments, ensuring the intended backward compatibility is fully functional.

gemini-code-assist · 2026-03-26T22:18:53Z

@@ -21,7 +21,7 @@
    "modelopt_mixed",
    "gguf",
    "awq_marlin",
-    "gptq",
+    "auto_gptq",


The QuantizationMethods literal is updated to "auto_gptq", but the method_to_config dictionary (lines 148-149) explicitly retains mappings for "gptq" and "gptq_marlin" for backward compatibility. If "gptq" or "gptq_marlin" are passed as quantization arguments, get_quantization_config will raise a ValueError because these strings will not be found in QUANTIZATION_METHODS (which is derived from get_args(QuantizationMethods)). This breaks the stated backward compatibility. Please include "gptq" and "gptq_marlin" in the QuantizationMethods literal to maintain compatibility.

Suggested change

"auto_gptq",

"auto_gptq",

"gptq",

"gptq_marlin",

Keeped gptq and gptq_marlin in QuantizationMethods

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mgoin · 2026-03-28T00:54:46Z

-        self.desc_act = desc_act
-        self.lm_head_quantized = lm_head_quantized
-        self.pack_factor = Fraction(32, self.weight_bits)
-        if self.weight_bits not in [2, 3, 4, 8]:


It is intentional that we are dropping support for 2 and 3 bit right?

Yes. The Marlin kernels only support 4-bit and 8-bit symmetric GPTQ. The Exllama fallback kernel also only lists uint4b8 and uint8b128 as supported (2/3-bit is commented out as untested). Since neither backend actively supports 2/3-bit, we're dropping it from the config rather than silently failing at runtime.

mergify · 2026-03-30T16:22:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chengyinie.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2026-03-31T15:53:35Z

                "awq_marlin",
+                "auto_gptq",


Can we remove this? Why not?

gptq and gptq_marlin are in the overrides list because they also resolve to AutoGPTQConfig which has override_quantization_method(), and the validation requires any method with that override to be listed.

robertgshaw2-redhat · 2026-03-31T15:54:14Z

    with vllm_runner(model_id, dtype=torch.float16, max_model_len=512) as llm:

        def check_model(model_id):
            for name, submodule in model_id.named_modules():
                # Could check more modules if necessary
                if name == "model_id.layers.0.self_attn.qkv_proj":
-                    assert isinstance(submodule.quant_method, linear_method_cls)
-
-                    config = submodule.quant_method.quant_config


why does this get deleted?

It is intentional that we are dropping support for 2 and 3 bit right?
Marlin only supports 4-bit and 8-bit symmetric, and even the Exllama fallback kernel only lists uint4b8 and uint8b128 as supported — 2/3-bit is explicitly noted as untested. Since no kernel backend actively supports or tests 2/3-bit GPTQ, it's better to reject it early with a clear error than silently produce incorrect results.

For test_gptq_v2.py — the test model (Qwen3-1.7B-w2g64-gptq_v2) is 2-bit, so the tests are now skipped. The config property assertions were removed since they would never execute. I kept the file as a placeholder so it can be unskipped if a 4/8-bit GPTQv2 test model is added later. Happy to remove the file entirely if you prefer.

robertgshaw2-redhat · 2026-03-31T15:56:04Z

@@ -293,6 +266,16 @@ def is_gptq_marlin_compatible(cls, quant_config: dict[str, Any]):
            quant_type=cls.TYPE_MAP[(num_bits, sym)], group_size=group_size
        )

+    @classmethod
+    def override_quantization_method(


cant we get rid of this?

The override is needed to handle the case where the user passes --quantization auto_gptq (or gptq_marlin) but the HF config has quant_method: gptq. Without override_quantization_method(), the verification would raise a mismatch error between auto_gptq (user) and gptq (HF config). The override normalizes all variants to auto_gptq before that check runs.

mergify · 2026-05-13T18:40:32Z

Hi @chengyinie, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

cnie-rblx · 2026-05-13T19:27:57Z

Rebased. CI mypy/ruff failures are pre-existing on main from #35859 (quark_moe.py references undefined FusedMoE). Our tests pass locally -- verified test_auto_gptq.py in Docker. Will auto-pass once the quark fix lands.

- Renames gptq_marlin.py → auto_gptq.py - GPTQMarlinConfig renamed to AutoGPTQConfig; get_name() returns "auto_gptq" - Adds override_quantization_method() to auto-convert models with quant_method: gptq to use auto_gptq - Updates all imports across the codebase to use auto_gptq module - Adds gptq, gptq_marlin to overrides list in model.py for validation - Adds test_auto_gptq.py for the new quantization method - Removes legacy gptq.py (exllama kernels); Marlin covers all use cases - Maintains backward compatibility: "gptq" and "gptq_marlin" still work Closes vllm-project#37765 Signed-off-by: Chengyi Nie <cnie@roblox.com>

…tq rename Update test expectations in test_configs.py to reflect that AutoGPTQConfig.get_name() now returns "auto_gptq" instead of "gptq_marlin". Also add "auto_gptq" to ROCm's supported_quantization list so models resolve correctly on ROCm. Signed-off-by: Chengyi Nie <cnie@roblox.com> Made-with: Cursor

…rename Update int_wna16.py imports and type annotations from GPTQMarlinConfig to AutoGPTQConfig, and update cpu.yaml CI file watcher path from gptq_marlin.py to auto_gptq.py. These files were added/modified after the original PR was authored and still referenced the old module. Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>

@Isotr0py

Root cause: MergedColumnParallelLinear with output_sizes=[num_v_heads]*2 produces per-rank outputs below Marlin's MIN_THREAD_N=64 at TP>=2 (e.g. Qwen3.5 397B: 64/TP=4 = 16 < 64). Fix per @Isotr0py's suggestion: pass disable_tp=True to the existing MergedColumnParallelLinear when a Marlin-backed quant config is in use on a non-interleaved (Qwen3.5) layout, so each rank computes the full projection. forward_cuda() then slices b/a to the local TP partition. Qwen3-Next's interleaved layout, other quant schemes (FP8, unquantized), and non-CUDA platforms keep normal TP sharding. Rebased onto current main: - Imports renamed: GPTQMarlinConfig moved to AutoGPTQConfig in vllm-project#38288. - Gated maybe_disable_tp on (not gqa_interleaved_layout) so the new fix_query_key_value_ordering path on Qwen3-Next stays correct (it expects a TP-sharded mixed_ba shape for the [ng, 2*np/ng] view). - forward_cuda Qwen3.5 branch chunks via the new split_ba helper. - ROCm/XPU/CPU paths skipped via current_platform.is_cuda() gate. Closes vllm-project#35924 Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

wenhuach21 · 2026-06-03T02:00:45Z

It looks like this PR causes 2-bit and 3-bit no longer be supported in GPTQ. Is this an intentional change, or have I overlooked something?

wenhuach21 · 2026-06-03T02:05:15Z

Since GPTQ is one of the few backends that supports 2-bit and 3-bit quantization, keeping this support could benefit the community. Although these configurations may be slower and sometimes suffer from accuracy issues, they are still valuable for exploring the limits of extremely low-bit quantization.

cnie-rblx · 2026-06-03T21:48:57Z

Yes, this was intentional as we discussed during the review process. The consolidation removed the old gptq.py (Exllama kernels) and unified everything under auto_gptq.py (Marlin kernels). Since Marlin only supports 4-bit and 8-bit symmetric (uint4b8, uint8b128), and the Exllama fallback had 2/3-bit commented out as untested, we chose to reject unsupported configs early rather than silently fail at runtime.

I agree 2/3-bit GPTQ has value for the community, especially for low-bit quantization research. There are some paths to restore it in my mind:

Add 2/3-bit entries to TYPE_MAP and validate they work with an existing kernel backend (Exllama or a new one). This would be a follow-up PR.
Bring back the Exllama kernel path as a fallback specifically for 2/3-bit configs that Marlin can't handle.

@robertgshaw2-redhat @jikunshang WDYT about it?

torotoki · 2026-06-04T04:33:07Z

Would really like to see 2/3-bit GPTQ support revived. I agree with the @cnie-rblx plan, but I can also work on it if you prefer.

…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>

torotoki · 2026-06-09T18:47:59Z

I created a separate issue as a feature request: #45051

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

jikunshang reviewed Mar 27, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/__init__.py Outdated

chengyinie changed the base branch from gptq-consolidation to main March 27, 2026 16:28

mergify Bot added rocm Related to AMD ROCm v1 labels Mar 27, 2026

github-project-automation Bot added this to AMD Mar 27, 2026

github-project-automation Bot moved this to Todo in AMD Mar 27, 2026

chengyinie marked this pull request as ready for review March 27, 2026 17:29

chengyinie requested review from DarkLight1337, ProExpertProg, WoosukKwon, hmellor, houseroad, markmc, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners March 27, 2026 17:29

claude Bot reviewed Mar 27, 2026

View reviewed changes

mgoin reviewed Mar 28, 2026

View reviewed changes

mergify Bot added the needs-rebase label Mar 30, 2026

robertgshaw2-redhat reviewed Mar 31, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/auto_gptq.py Outdated

robertgshaw2-redhat reviewed Mar 31, 2026

View reviewed changes

cnie-rblx and others added 3 commits May 13, 2026 16:22

chengyinie force-pushed the gptq-consolidation branch from aaa9a46 to 2bda97d Compare May 13, 2026 23:22

jikunshang approved these changes May 14, 2026

View reviewed changes

chengyinie added 2 commits May 13, 2026 18:26

Merge branch 'main' into gptq-consolidation

fd7066a

Merge branch 'main' into gptq-consolidation

25cb6b2

jikunshang merged commit fa2a33b into vllm-project:main May 15, 2026
90 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 15, 2026

WingedGuardian mentioned this pull request May 18, 2026

[Bugfix] Route INT8 GPTQ MoE to WNA16 fallback #42022

Closed

5 tasks

sonusflow mentioned this pull request May 19, 2026

[Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 #36329

Merged

8 tasks

Alex-ai-future mentioned this pull request May 26, 2026

fix(quantization): Fix AWQ dequantize on Intel XPU and refactor AutoAWQ config #42727

Open

torotoki mentioned this pull request Jun 9, 2026

[Feature]: Restore support for 2-bit and 3-bit GPTQ #45051

Open

1 task

Uh oh!

Conversation

chengyinie commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 30, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

cnie-rblx commented May 13, 2026

Uh oh!

Uh oh!

wenhuach21 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenhuach21 commented Jun 3, 2026

Uh oh!

cnie-rblx commented Jun 3, 2026

Uh oh!

torotoki commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

torotoki commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chengyinie commented Mar 26, 2026 •

edited

Loading

wenhuach21 commented Jun 3, 2026 •

edited

Loading

torotoki commented Jun 4, 2026 •

edited

Loading