Skip to content

[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py#38288

Merged
jikunshang merged 5 commits into
vllm-project:mainfrom
chengyinie:gptq-consolidation
May 15, 2026
Merged

[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py#38288
jikunshang merged 5 commits into
vllm-project:mainfrom
chengyinie:gptq-consolidation

Conversation

@chengyinie

@chengyinie chengyinie commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Purpose

Consolidate GPTQ quantization by renaming gptq_marlin.py to auto_gptq.py as requested in #37765.
Key changes:

  • Renames gptq_marlin.pyauto_gptq.py
  • GPTQMarlinConfig.get_name() now returns "auto_gptq"
  • Adds override_quantization_method() to auto-convert models with quant_method: gptq to use auto_gptq
  • Updates all imports across the codebase to use auto_gptq module
  • Adds auto_gptq to overrides list in model.py
  • Adds test_auto_gptq.py for the new quantization method
  • Updates test_gptq_marlin.py skipif condition to check auto_gptq support
  • Maintains backward compatibility: quantization="gptq" still works (maps to GPTQMarlinConfig)
    Closes [Feature]: Consolidate GPTQ Quantization #37765

Test Plan

pytest tests/quantization/test_auto_gptq.py -v -s
pytest tests/quantization/test_gptq_dynamic.py -v -s
pytest tests/quantization/test_lm_head.py -v -s
pytest tests/models/quantization/test_gptq_marlin.py -v -s

## Test Result
 All passed.
---
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [*] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [*] The test plan, such as providing test command.
- [*] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model.
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0).
</details>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request renames the gptq_marlin quantization method to auto_gptq across the codebase, updating tests, imports, and configuration logic. It also introduces a mechanism for backward compatibility to handle existing gptq and gptq_marlin configurations. Feedback suggests that the QuantizationMethods literal should still include the legacy names to prevent validation errors when they are passed as arguments, ensuring the intended backward compatibility is fully functional.

@@ -21,7 +21,7 @@
"modelopt_mixed",
"gguf",
"awq_marlin",
"gptq",
"auto_gptq",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The QuantizationMethods literal is updated to "auto_gptq", but the method_to_config dictionary (lines 148-149) explicitly retains mappings for "gptq" and "gptq_marlin" for backward compatibility. If "gptq" or "gptq_marlin" are passed as quantization arguments, get_quantization_config will raise a ValueError because these strings will not be found in QUANTIZATION_METHODS (which is derived from get_args(QuantizationMethods)). This breaks the stated backward compatibility. Please include "gptq" and "gptq_marlin" in the QuantizationMethods literal to maintain compatibility.

Suggested change
"auto_gptq",
"auto_gptq",
"gptq",
"gptq_marlin",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeped gptq and gptq_marlin in QuantizationMethods

Comment thread vllm/model_executor/layers/quantization/__init__.py Outdated
@chengyinie chengyinie changed the base branch from gptq-consolidation to main March 27, 2026 16:28
@mergify mergify Bot added rocm Related to AMD ROCm v1 labels Mar 27, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 27, 2026
@chengyinie chengyinie marked this pull request as ready for review March 27, 2026 17:29

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Comment thread vllm/model_executor/layers/quantization/auto_gptq.py
self.desc_act = desc_act
self.lm_head_quantized = lm_head_quantized
self.pack_factor = Fraction(32, self.weight_bits)
if self.weight_bits not in [2, 3, 4, 8]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional that we are dropping support for 2 and 3 bit right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The Marlin kernels only support 4-bit and 8-bit symmetric GPTQ. The Exllama fallback kernel also only lists uint4b8 and uint8b128 as supported (2/3-bit is commented out as untested). Since neither backend actively supports 2/3-bit, we're dropping it from the config rather than silently failing at runtime.

@mergify

mergify Bot commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chengyinie.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 30, 2026
Comment thread vllm/model_executor/layers/quantization/auto_gptq.py Outdated
Comment thread vllm/config/model.py
"awq_marlin",
"auto_gptq",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this? Why not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gptq and gptq_marlin are in the overrides list because they also resolve to AutoGPTQConfig which has override_quantization_method(), and the validation requires any method with that override to be listed.

with vllm_runner(model_id, dtype=torch.float16, max_model_len=512) as llm:

def check_model(model_id):
for name, submodule in model_id.named_modules():
# Could check more modules if necessary
if name == "model_id.layers.0.self_attn.qkv_proj":
assert isinstance(submodule.quant_method, linear_method_cls)

config = submodule.quant_method.quant_config

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this get deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional that we are dropping support for 2 and 3 bit right?
Marlin only supports 4-bit and 8-bit symmetric, and even the Exllama fallback kernel only lists uint4b8 and uint8b128 as supported — 2/3-bit is explicitly noted as untested. Since no kernel backend actively supports or tests 2/3-bit GPTQ, it's better to reject it early with a clear error than silently produce incorrect results.

For test_gptq_v2.py — the test model (Qwen3-1.7B-w2g64-gptq_v2) is 2-bit, so the tests are now skipped. The config property assertions were removed since they would never execute. I kept the file as a placeholder so it can be unskipped if a 4/8-bit GPTQv2 test model is added later. Happy to remove the file entirely if you prefer.

@@ -293,6 +266,16 @@ def is_gptq_marlin_compatible(cls, quant_config: dict[str, Any]):
quant_type=cls.TYPE_MAP[(num_bits, sym)], group_size=group_size
)

@classmethod
def override_quantization_method(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant we get rid of this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The override is needed to handle the case where the user passes --quantization auto_gptq (or gptq_marlin) but the HF config has quant_method: gptq. Without override_quantization_method(), the verification would raise a mismatch error between auto_gptq (user) and gptq (HF config). The override normalizes all variants to auto_gptq before that check runs.

@mergify

mergify Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Hi @chengyinie, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@cnie-rblx

Copy link
Copy Markdown
Contributor

Rebased. CI mypy/ruff failures are pre-existing on main from #35859 (quark_moe.py references undefined FusedMoE). Our tests pass locally -- verified test_auto_gptq.py in Docker. Will auto-pass once the quark fix lands.

cnie-rblx and others added 3 commits May 13, 2026 16:22
- Renames gptq_marlin.py → auto_gptq.py
- GPTQMarlinConfig renamed to AutoGPTQConfig; get_name() returns "auto_gptq"
- Adds override_quantization_method() to auto-convert models with
  quant_method: gptq to use auto_gptq
- Updates all imports across the codebase to use auto_gptq module
- Adds gptq, gptq_marlin to overrides list in model.py for validation
- Adds test_auto_gptq.py for the new quantization method
- Removes legacy gptq.py (exllama kernels); Marlin covers all use cases
- Maintains backward compatibility: "gptq" and "gptq_marlin" still work

Closes vllm-project#37765

Signed-off-by: Chengyi Nie <cnie@roblox.com>
…tq rename

Update test expectations in test_configs.py to reflect that
AutoGPTQConfig.get_name() now returns "auto_gptq" instead of
"gptq_marlin". Also add "auto_gptq" to ROCm's supported_quantization
list so models resolve correctly on ROCm.

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Made-with: Cursor
…rename

Update int_wna16.py imports and type annotations from GPTQMarlinConfig
to AutoGPTQConfig, and update cpu.yaml CI file watcher path from
gptq_marlin.py to auto_gptq.py. These files were added/modified after
the original PR was authored and still referenced the old module.

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@chengyinie chengyinie force-pushed the gptq-consolidation branch from aaa9a46 to 2bda97d Compare May 13, 2026 23:22
@jikunshang jikunshang merged commit fa2a33b into vllm-project:main May 15, 2026
90 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 15, 2026
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
sonusflow added a commit to sonusflow/vllm that referenced this pull request May 19, 2026
Root cause: MergedColumnParallelLinear with output_sizes=[num_v_heads]*2
produces per-rank outputs below Marlin's MIN_THREAD_N=64 at TP>=2
(e.g. Qwen3.5 397B: 64/TP=4 = 16 < 64).

Fix per @Isotr0py's suggestion: pass disable_tp=True to the existing
MergedColumnParallelLinear when a Marlin-backed quant config is in use
on a non-interleaved (Qwen3.5) layout, so each rank computes the full
projection. forward_cuda() then slices b/a to the local TP partition.
Qwen3-Next's interleaved layout, other quant schemes (FP8, unquantized),
and non-CUDA platforms keep normal TP sharding.

Rebased onto current main:
- Imports renamed: GPTQMarlinConfig moved to AutoGPTQConfig in vllm-project#38288.
- Gated maybe_disable_tp on (not gqa_interleaved_layout) so the new
  fix_query_key_value_ordering path on Qwen3-Next stays correct (it
  expects a TP-sharded mixed_ba shape for the [ng, 2*np/ng] view).
- forward_cuda Qwen3.5 branch chunks via the new split_ba helper.
- ROCm/XPU/CPU paths skipped via current_platform.is_cuda() gate.

Closes vllm-project#35924

Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Liuweixiong0118 pushed a commit to Liuweixiong0118/vllm that referenced this pull request Jun 1, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
@wenhuach21

wenhuach21 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

It looks like this PR causes 2-bit and 3-bit no longer be supported in GPTQ. Is this an intentional change, or have I overlooked something?

@wenhuach21

Copy link
Copy Markdown
Contributor

Since GPTQ is one of the few backends that supports 2-bit and 3-bit quantization, keeping this support could benefit the community. Although these configurations may be slower and sometimes suffer from accuracy issues, they are still valuable for exploring the limits of extremely low-bit quantization.

@cnie-rblx

Copy link
Copy Markdown
Contributor

Yes, this was intentional as we discussed during the review process. The consolidation removed the old gptq.py (Exllama kernels) and unified everything under auto_gptq.py (Marlin kernels). Since Marlin only supports 4-bit and 8-bit symmetric (uint4b8, uint8b128), and the Exllama fallback had 2/3-bit commented out as untested, we chose to reject unsupported configs early rather than silently fail at runtime.

I agree 2/3-bit GPTQ has value for the community, especially for low-bit quantization research. There are some paths to restore it in my mind:

  • Add 2/3-bit entries to TYPE_MAP and validate they work with an existing kernel backend (Exllama or a new one). This would be a follow-up PR.

  • Bring back the Exllama kernel path as a fallback specifically for 2/3-bit configs that Marlin can't handle.

@robertgshaw2-redhat @jikunshang WDYT about it?

@torotoki

torotoki commented Jun 4, 2026

Copy link
Copy Markdown

Would really like to see 2/3-bit GPTQ support revived. I agree with the @cnie-rblx plan, but I can also work on it if you prefer.

mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…-project#38288)

Signed-off-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Chengyi Nie <cnie@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@torotoki

torotoki commented Jun 9, 2026

Copy link
Copy Markdown

I created a separate issue as a feature request: #45051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Consolidate GPTQ Quantization

7 participants