Add NPU (Ascend) backend support for INT4 weight-only quantization workflow#3172
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3172
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 89ad729 with merge base 1e5bc3b ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| try: | ||
| import torch_npu | ||
| except ImportError: | ||
| torch_npu = None | ||
|
|
There was a problem hiding this comment.
PyTorch provide Autoload mechinasm, so we do not need to import it explicitly.
| @unittest.skipIf(torch_npu is None, "torch_npu is not available") | ||
| @unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available") |
There was a problem hiding this comment.
| @unittest.skipIf(torch_npu is None, "torch_npu is not available") | |
| @unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available") | |
| @unittest.skipIf(torch.accelerator.current_accelerator(True).type == "npu" and torch.accelerator.is_available(), "NPU not available") |
| @unittest.skipIf( | ||
| version.parse(torch_npu.__version__) < version.parse("2.7.1rc1"), | ||
| "Need torch_npu 2.7.1rc1+", | ||
| ) |
There was a problem hiding this comment.
We can remove it because there are some strcit version mapping between PyTorch and Torch_NPU
| ) | ||
|
|
||
| assert int_data.dtype == torch.int32, ( | ||
| f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype" |
There was a problem hiding this comment.
| f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype" | |
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects `int32` dtype" |
| ) | ||
|
|
||
| assert int_data.shape[-1] % 8 == 0, ( | ||
| f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" |
There was a problem hiding this comment.
| f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" | |
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" |
|
Hi @jcaip @jerryzh168 , please help to review it, thanks! |
| and torch.accelerator.is_available(), | ||
| "NPU not available", | ||
| ) | ||
| class Int4PlainInt32TensorNPU(TestCase): |
There was a problem hiding this comment.
Just curious, do we need NPUs to test this? I don't think we have any in CI.
jcaip
left a comment
There was a problem hiding this comment.
Thanks for the PR @orangeH25 @fffrog!
The code looks good to me, but I'm curious on how to best test this? It looks like we skip tests in CI because we don't have NPU devices. I believe that NPU support was added to TorchTune as well, do you know how they test device specific functionality there?
Also, just a heads up most of the team is at PTC / Open source AI week in SF this week, so we might be a little slow in responding :)
|
please don't include device |
| int4 weight-only quantization on Ascend NPU backend (groupwise quantization only) | ||
|
|
||
| Tensor Attributes: | ||
| qdata: (N, K/8), packed int4 weight, the data type is int32 here with 8*int4, the original dtype can be float16 or bfloat16 |
There was a problem hiding this comment.
does this exactly align with Int4PlainInt32Tensor? if so, please merge with that tensor subclass
|
Hi @jcaip @jerryzh168 ,thanks for the review!
Yes, this case is actually pretty common in open-source projects. A typical approach is to set up a
You mean that we should keep the entry logic in elif int4_packing_format == Int4PackingFormat.PLAIN_INT32:
new_weight = Int4PlainInt32Tensor.from_hp(
weight,
block_size,
)
return new_weightand then handle different backend implementations in the class Int4PlainInt32Tensor(TorchAOBaseTensor):
...
@classmethod
def from_hp(
cls,
w: torch.Tensor,
block_size: List[int],
):
if w.device.type == "xpu":
from_hp_xpu(cls, w, block_size)
elif w.device.type == "npu":
from_hp_npu(cls, w, block_size)
implements = Int4PlainInt32Tensor.implements
implements_torch_function = Int4PlainInt32Tensor.implements_torch_function
@implements(aten.linear.default)
@implements_torch_function(torch.nn.functional.linear)
def _(func, types, args, kwargs):
input_tensor, weight_tensor, bias = (
args[0],
args[1],
args[2] if len(args) > 2 else None,
)
if input_tensor.device.type == "xpu":
return linear_xpu(input_tensor, weight_tensor, bias)
elif input_tensor.device.type == "npu":
return linear_npu(input_tensor, weight_tensor, bias)Did I get that right? Happy to hear any thoughts or suggestions you might have! |
Yes that's correct |
Got it, I will follow this approach, thanks! |
7808297 to
ea2aa7a
Compare
Hi @jerryzh168 @jcaip , I’ve made those changes, please take a look, really appreciate it! |
|
Hi @jerryzh168 @jcaip, could you please take another look when you have a moment? Thanks a lot! |
jcaip
left a comment
There was a problem hiding this comment.
A couple nits but looks good to me @orangeH25 @fffrog
Can we set up the downstream testing you mentioned before we merge this?
|
|
||
| y = torch.ops.npu.npu_weight_quant_batchmatmul( | ||
| x=act_mat, | ||
| weight=packed_weight.contiguous().transpose(-1, -2), |
There was a problem hiding this comment.
do we want to call contiguous() every time we do matmul? should we save the packed_weight in contiguous format instead to only do this once?
There was a problem hiding this comment.
Thanks! Addressed — packed_weight are now made contiguous once when constructing the Int4PlainInt32Tensor.
| elif input_tensor.device.type == "npu": | ||
| return _linear_npu(input_tensor, weight_tensor, bias) | ||
| else: | ||
| raise AssertionError(f"Int4PlainInt32Tensor does not support device '{input_tensor.device.type}' yet.") |
There was a problem hiding this comment.
nit: NotImplementedError or ValueError is better here
| elif w.device.type == "npu": | ||
| return _from_hp_npu(cls, w, block_size) | ||
| else: | ||
| raise AssertionError(f"Int4PlainInt32Tensor does not support device '{w.device.type}' yet.") |
There was a problem hiding this comment.
nit: ValueError or NotImplementedError here.
| assert int_data.shape[-1] % 8 == 0, ( | ||
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Will we ever run into a case where we have NPU support but this op is missing? Maybe in an earlier version of torch_npu? Should we throw a cleaner error message in that case?
It'd be good to add a comment here on where this op is defined and what version of torch npu is needed.
There was a problem hiding this comment.
Thanks for the reminder — since torch and torch_npu versions are tightly coupled, I added
# Require PyTorch 2.7.1+ for NPU backend ops and backward compatibility.
assert torch_version_at_least("2.7.1"), (
"Need PyTorch 2.7.1+ for NPU backend op support."
)at the beginning. Does this make the version requirement clear enough?
There was a problem hiding this comment.
Let's be a tad more explicit, I want to make it clear it's PyTorch NPU >= 2.7.1 and not regular torch
assert (torch.accelerator.is_available() and torch.accelerator.current_accelerator().type == "npu" and torch_version_at_least("2.7.1"), (
f"PyTorch NPU 2.7.1+ needed for int4 packing and matmul ops, {torch.__version__} found"
)
Sure! We’ll complete the downstream testing setup. |
|
Awesome, thank you! If you have any benchmarking numbers you can share as well that would be great :) |
Thanks! We’ll add some benchmarking results soon. |
| @unittest.skipIf(not torch_version_at_least("2.8.0"), "Need pytorch 2.8+") | ||
| @unittest.skipIf(not torch.xpu.is_available(), "XPU not available") | ||
| class Int4PlainInt32Tensor(TestCase): | ||
| class Int4PlainInt32TensorXPU(TestCase): |
| or not torch.accelerator.is_available(), | ||
| "NPU not available", | ||
| ) | ||
| class Int4PlainInt32TensorNPU(TestCase): |
There was a problem hiding this comment.
can this be merged with the xpu case?
There was a problem hiding this comment.
Sure, I’ll combine them into a single test class.
jerryzh168
left a comment
There was a problem hiding this comment.
changes looks good to me, just wondering if we can merge the tests
|
also looks like a lot of broken CI, please make sure these passes |
|
also for benchmark, maybe @jainapurva can you share some guide for adding benchmark for these new things in torchao? |
Thanks! Yes, I’ll go ahead and merge them, and address the CI failures in the same update. |
|
|
||
| @unittest.skipIf(not torch_version_at_least("2.8.0"), "Need pytorch 2.8+") | ||
| @unittest.skipIf(not torch.xpu.is_available(), "XPU not available") | ||
| _MIN_VER = { |
There was a problem hiding this comment.
@orangeH25 Thank you for the update.
PyTorch has a helper function called instantiate_device_type_tests that can help you easily automate the creation of device test cases, which should meet your needs.
38b1f49 to
fa3220f
Compare
|
Hi @jerryzh168 @jcaip, the XPU and NPU test cases have already been merged. Please take a look — feedback welcome. Thanks! |
|
Hi @jerryzh168 @jcaip, just a gentle ping on this one when you get a chance. |
jcaip
left a comment
There was a problem hiding this comment.
Changes look good to me, but we should put the docs in the quantization README and not the top level README.
|
|
||
| Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! | ||
|
|
||
| ## Third-party Pipeline Status |
There was a problem hiding this comment.
Can we add this and mention NPU support requirements (torch_npu >=2.7.1) in the quantization README instead? I would put here: https://github.com/pytorch/ao/tree/main/torchao/quantization#a16w4-weightonly-quantization
There was a problem hiding this comment.
Sure, I’ll add a subheading under the a16w4-weightonly-quantization section for this
| if "xpu" in device and dtype == torch.float16: | ||
| pytest.skip(f"{device} test_activation_prescaling don't test {dtype}") | ||
|
|
||
| threshold = thresholds.get(device.split(":")[0]) |
There was a problem hiding this comment.
does device_type have :? I thought it should only be things like xpu, npu, cuda, not cuda:0
There was a problem hiding this comment.
Ah, good catch — you’re right! I actually meant to use the function argument device there, but forgot to remove device = self.device_type from the setup.
device can include the suffix like ":0", while device_type should not. I’ll fix that, thanks for pointing it out!
jerryzh168
left a comment
There was a problem hiding this comment.
looks good overall I think, thanks
|
It looks like some CI failures may be caused by my branch not being up to date with the latest main. I’ll rebase it |
|
Hi @jerryzh168 @jcaip, please take a look, thanks! |
|
Hi @jerryzh168 @jcaip, gentle ping on this PR — happy to make any changes needed. Thanks for your time! |
|
merging, ci failure looks unrelated. Thanks again for working on this @orangeH25 @fffrog! Let us know how the performance work progresses, we can give a shout-out in GPU-MODE or something once it's good. |
|
@jerryzh168 @jcaip , thanks a lot! We’ll keep pushing on the performance side and share updates once we have solid numbers. Thanks again for all the guidance! |
…rkflow (pytorch#3172) * Add NPU (Ascend) backend support for INT4 weight-only quantization workflow * use torch.ops.npu prefix and drop redundant torch_npu import * Modify test file and update comments * add: merge NPU(Ascend) backend logic in Int4PlainInt32Tensor subclass * ruff format cleanup, replace error types, add torch version check * add torch_npu version assertion and show downstream testing result * add downstream testing result * unify NPU and XPU test cases into a single class * move CI display to quantization README and update test file
Related to #3044
Summary
This PR adds NPU (Ascend) backend support for the INT4 weight-only quantization workflow.
It introduces a new tensor subclass,
Int4PlainInt32TensorNPU, aligned with the existingInt4PlainInt32Tensorfor theplain_int32packing format.Environment
Files changed
Modified
torchao/quantization/__init__.pytorchao/quantization/quant_api.pytorchao/quantization/quantize_/workflows/__init__.pyAdded
torchao/quantization/quantize_/workflows/int4/int4_plain_int32_tensor_npu.pytest/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.pyImplementation Overview
Int4PlainInt32TensorNPUto enable NPU backend support for INT4 weight-only quantization.quant_api.pyfor dispatch.__init__.pyfiles to ensure proper import and exposure.Test Case
test/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.py