[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py#38288
Conversation
There was a problem hiding this comment.
Code Review
This pull request renames the gptq_marlin quantization method to auto_gptq across the codebase, updating tests, imports, and configuration logic. It also introduces a mechanism for backward compatibility to handle existing gptq and gptq_marlin configurations. Feedback suggests that the QuantizationMethods literal should still include the legacy names to prevent validation errors when they are passed as arguments, ensuring the intended backward compatibility is fully functional.
| @@ -21,7 +21,7 @@ | |||
| "modelopt_mixed", | |||
| "gguf", | |||
| "awq_marlin", | |||
| "gptq", | |||
| "auto_gptq", | |||
There was a problem hiding this comment.
The QuantizationMethods literal is updated to "auto_gptq", but the method_to_config dictionary (lines 148-149) explicitly retains mappings for "gptq" and "gptq_marlin" for backward compatibility. If "gptq" or "gptq_marlin" are passed as quantization arguments, get_quantization_config will raise a ValueError because these strings will not be found in QUANTIZATION_METHODS (which is derived from get_args(QuantizationMethods)). This breaks the stated backward compatibility. Please include "gptq" and "gptq_marlin" in the QuantizationMethods literal to maintain compatibility.
| "auto_gptq", | |
| "auto_gptq", | |
| "gptq", | |
| "gptq_marlin", |
There was a problem hiding this comment.
Keeped gptq and gptq_marlin in QuantizationMethods
| self.desc_act = desc_act | ||
| self.lm_head_quantized = lm_head_quantized | ||
| self.pack_factor = Fraction(32, self.weight_bits) | ||
| if self.weight_bits not in [2, 3, 4, 8]: |
There was a problem hiding this comment.
It is intentional that we are dropping support for 2 and 3 bit right?
There was a problem hiding this comment.
Yes. The Marlin kernels only support 4-bit and 8-bit symmetric GPTQ. The Exllama fallback kernel also only lists uint4b8 and uint8b128 as supported (2/3-bit is commented out as untested). Since neither backend actively supports 2/3-bit, we're dropping it from the config rather than silently failing at runtime.
|
This pull request has merge conflicts that must be resolved before it can be |
| "awq_marlin", | ||
| "auto_gptq", |
There was a problem hiding this comment.
Can we remove this? Why not?
There was a problem hiding this comment.
gptq and gptq_marlin are in the overrides list because they also resolve to AutoGPTQConfig which has override_quantization_method(), and the validation requires any method with that override to be listed.
| with vllm_runner(model_id, dtype=torch.float16, max_model_len=512) as llm: | ||
|
|
||
| def check_model(model_id): | ||
| for name, submodule in model_id.named_modules(): | ||
| # Could check more modules if necessary | ||
| if name == "model_id.layers.0.self_attn.qkv_proj": | ||
| assert isinstance(submodule.quant_method, linear_method_cls) | ||
|
|
||
| config = submodule.quant_method.quant_config |
There was a problem hiding this comment.
why does this get deleted?
There was a problem hiding this comment.
It is intentional that we are dropping support for 2 and 3 bit right?
Marlin only supports 4-bit and 8-bit symmetric, and even the Exllama fallback kernel only lists uint4b8 and uint8b128 as supported — 2/3-bit is explicitly noted as untested. Since no kernel backend actively supports or tests 2/3-bit GPTQ, it's better to reject it early with a clear error than silently produce incorrect results.
For test_gptq_v2.py — the test model (Qwen3-1.7B-w2g64-gptq_v2) is 2-bit, so the tests are now skipped. The config property assertions were removed since they would never execute. I kept the file as a placeholder so it can be unskipped if a 4/8-bit GPTQv2 test model is added later. Happy to remove the file entirely if you prefer.
| @@ -293,6 +266,16 @@ def is_gptq_marlin_compatible(cls, quant_config: dict[str, Any]): | |||
| quant_type=cls.TYPE_MAP[(num_bits, sym)], group_size=group_size | |||
| ) | |||
|
|
|||
| @classmethod | |||
| def override_quantization_method( | |||
There was a problem hiding this comment.
cant we get rid of this?
There was a problem hiding this comment.
The override is needed to handle the case where the user passes --quantization auto_gptq (or gptq_marlin) but the HF config has quant_method: gptq. Without override_quantization_method(), the verification would raise a mismatch error between auto_gptq (user) and gptq (HF config). The override normalizes all variants to auto_gptq before that check runs.
|
Hi @chengyinie, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Rebased. CI mypy/ruff failures are pre-existing on main from #35859 (quark_moe.py references undefined FusedMoE). Our tests pass locally -- verified test_auto_gptq.py in Docker. Will auto-pass once the quark fix lands. |
- Renames gptq_marlin.py → auto_gptq.py - GPTQMarlinConfig renamed to AutoGPTQConfig; get_name() returns "auto_gptq" - Adds override_quantization_method() to auto-convert models with quant_method: gptq to use auto_gptq - Updates all imports across the codebase to use auto_gptq module - Adds gptq, gptq_marlin to overrides list in model.py for validation - Adds test_auto_gptq.py for the new quantization method - Removes legacy gptq.py (exllama kernels); Marlin covers all use cases - Maintains backward compatibility: "gptq" and "gptq_marlin" still work Closes vllm-project#37765 Signed-off-by: Chengyi Nie <cnie@roblox.com>
…tq rename Update test expectations in test_configs.py to reflect that AutoGPTQConfig.get_name() now returns "auto_gptq" instead of "gptq_marlin". Also add "auto_gptq" to ROCm's supported_quantization list so models resolve correctly on ROCm. Signed-off-by: Chengyi Nie <cnie@roblox.com> Made-with: Cursor
…rename Update int_wna16.py imports and type annotations from GPTQMarlinConfig to AutoGPTQConfig, and update cpu.yaml CI file watcher path from gptq_marlin.py to auto_gptq.py. These files were added/modified after the original PR was authored and still referenced the old module. Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
aaa9a46 to
2bda97d
Compare
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Root cause: MergedColumnParallelLinear with output_sizes=[num_v_heads]*2 produces per-rank outputs below Marlin's MIN_THREAD_N=64 at TP>=2 (e.g. Qwen3.5 397B: 64/TP=4 = 16 < 64). Fix per @Isotr0py's suggestion: pass disable_tp=True to the existing MergedColumnParallelLinear when a Marlin-backed quant config is in use on a non-interleaved (Qwen3.5) layout, so each rank computes the full projection. forward_cuda() then slices b/a to the local TP partition. Qwen3-Next's interleaved layout, other quant schemes (FP8, unquantized), and non-CUDA platforms keep normal TP sharding. Rebased onto current main: - Imports renamed: GPTQMarlinConfig moved to AutoGPTQConfig in vllm-project#38288. - Gated maybe_disable_tp on (not gqa_interleaved_layout) so the new fix_query_key_value_ordering path on Qwen3-Next stays correct (it expects a TP-sharded mixed_ba shape for the [ng, 2*np/ng] view). - forward_cuda Qwen3.5 branch chunks via the new split_ba helper. - ROCm/XPU/CPU paths skipped via current_platform.is_cuda() gate. Closes vllm-project#35924 Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
|
It looks like this PR causes 2-bit and 3-bit no longer be supported in GPTQ. Is this an intentional change, or have I overlooked something? |
|
Since GPTQ is one of the few backends that supports 2-bit and 3-bit quantization, keeping this support could benefit the community. Although these configurations may be slower and sometimes suffer from accuracy issues, they are still valuable for exploring the limits of extremely low-bit quantization. |
|
Yes, this was intentional as we discussed during the review process. The consolidation removed the old gptq.py (Exllama kernels) and unified everything under auto_gptq.py (Marlin kernels). Since Marlin only supports 4-bit and 8-bit symmetric (uint4b8, uint8b128), and the Exllama fallback had 2/3-bit commented out as untested, we chose to reject unsupported configs early rather than silently fail at runtime. I agree 2/3-bit GPTQ has value for the community, especially for low-bit quantization research. There are some paths to restore it in my mind:
@robertgshaw2-redhat @jikunshang WDYT about it? |
|
Would really like to see 2/3-bit GPTQ support revived. I agree with the @cnie-rblx plan, but I can also work on it if you prefer. |
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…-project#38288) Signed-off-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Chengyi Nie <cnie@roblox.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
I created a separate issue as a feature request: #45051 |
Purpose
Consolidate GPTQ quantization by renaming
gptq_marlin.pytoauto_gptq.pyas requested in #37765.Key changes:
gptq_marlin.py→auto_gptq.pyGPTQMarlinConfig.get_name()now returns"auto_gptq"override_quantization_method()to auto-convert models withquant_method: gptqto useauto_gptqauto_gptqmoduleauto_gptqto overrides list inmodel.pytest_auto_gptq.pyfor the new quantization methodtest_gptq_marlin.pyskipif condition to checkauto_gptqsupportquantization="gptq"still works (maps to GPTQMarlinConfig)Closes [Feature]: Consolidate GPTQ Quantization #37765
Test Plan