Add NPU (Ascend) backend support for INT4 weight-only quantization workflow by orangeH25 · Pull Request #3172 · pytorch/ao

orangeH25 · 2025-10-14T11:44:22Z

Related to #3044

Summary

This PR adds NPU (Ascend) backend support for the INT4 weight-only quantization workflow.

It introduces a new tensor subclass, Int4PlainInt32TensorNPU, aligned with the existing Int4PlainInt32Tensor for the plain_int32 packing format.

Environment

torchao version: 0.13.0 (main branch, commit: f64daac)
torch version: 2.7.1
torch_npu version: 2.7.1rc1
Ascend Toolkit (CANN): 8.2.RC1
Device: Ascend 910B4
OS: EulerOS 2.10 (Kernel 4.19.90, aarch64)
Python: 3.11

Files changed

Modified

torchao/quantization/__init__.py
torchao/quantization/quant_api.py
torchao/quantization/quantize_/workflows/__init__.py

Added

torchao/quantization/quantize_/workflows/int4/int4_plain_int32_tensor_npu.py
test/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.py

Implementation Overview

Introduces Int4PlainInt32TensorNPU to enable NPU backend support for INT4 weight-only quantization.
Registeres new tensor subclass and integrated into quant_api.py for dispatch.
Updates __init__.py files to ensure proper import and exposure.
Adds corresponding test cases for NPU workflow.

Test Case

test/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.py

…rkflow

pytorch-bot · 2025-10-14T11:44:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3172

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89ad729 with merge base 1e5bc3b ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestOptim::test_param_groups_optim_name_AdamFp8_device_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fffrog · 2025-10-14T13:47:30Z

+try:
+    import torch_npu
+except ImportError:
+    torch_npu = None
+


PyTorch provide Autoload mechinasm, so we do not need to import it explicitly.

fffrog · 2025-10-14T13:50:58Z

+@unittest.skipIf(torch_npu is None, "torch_npu is not available")
+@unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available")


Suggested change

@unittest.skipIf(torch_npu is None, "torch_npu is not available")

@unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available")

@unittest.skipIf(torch.accelerator.current_accelerator(True).type == "npu" and torch.accelerator.is_available(), "NPU not available")

fffrog · 2025-10-14T13:56:02Z

+@unittest.skipIf(
+    version.parse(torch_npu.__version__) < version.parse("2.7.1rc1"),
+    "Need torch_npu 2.7.1rc1+",
+)


We can remove it because there are some strcit version mapping between PyTorch and Torch_NPU

fffrog · 2025-10-14T13:57:07Z

+        )
+
+        assert int_data.dtype == torch.int32, (
+            f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype"


Suggested change

f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype"

f"torch.ops.npu.npu_convert_weight_to_int4pack expects `int32` dtype"

fffrog · 2025-10-14T13:57:22Z

+        )
+
+        assert int_data.shape[-1] % 8 == 0, (
+            f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}"


Suggested change

f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}"

f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}"

orangeH25 · 2025-10-15T06:30:26Z

Hi @jcaip @jerryzh168 , please help to review it, thanks!

jcaip · 2025-10-20T14:32:30Z

+    and torch.accelerator.is_available(),
+    "NPU not available",
+)
+class Int4PlainInt32TensorNPU(TestCase):


Just curious, do we need NPUs to test this? I don't think we have any in CI.

jcaip

Thanks for the PR @orangeH25 @fffrog!

The code looks good to me, but I'm curious on how to best test this? It looks like we skip tests in CI because we don't have NPU devices. I believe that NPU support was added to TorchTune as well, do you know how they test device specific functionality there?

Also, just a heads up most of the team is at PTC / Open source AI week in SF this week, so we might be a little slow in responding :)

jerryzh168 · 2025-10-20T21:42:15Z

please don't include device NPU in the name of Tensor, since packing_format is supposed to be agnostic to device

jerryzh168 · 2025-10-20T21:44:23Z

+    int4 weight-only quantization on Ascend NPU backend (groupwise quantization only)
+
+    Tensor Attributes:
+        qdata: (N, K/8), packed int4 weight, the data type is int32 here with 8*int4, the original dtype can be float16 or bfloat16


does this exactly align with Int4PlainInt32Tensor? if so, please merge with that tensor subclass

orangeH25 · 2025-10-21T09:12:12Z

Hi @jcaip @jerryzh168 ,thanks for the review!

Just curious, do we need NPUs to test this? I don't think we have any in CI.

Yes, this case is actually pretty common in open-source projects.

A typical approach is to set up a nightly CI job in the downstream repo that automatically pulls the latest code from the upstream repo each day for build and testing. The results are then shown as a GitHub badge in the upstream repo’s README.md to clearly show the latest build status.

does this exactly align with Int4PlainInt32Tensor? if so, please merge with that tensor subclass

You mean that we should keep the entry logic in quant_api.py unchanged:

elif int4_packing_format == Int4PackingFormat.PLAIN_INT32:
    new_weight = Int4PlainInt32Tensor.from_hp(
        weight,
        block_size,
    )
    return new_weight

and then handle different backend implementations in the from_hp and linear methods of Int4PlainInt32Tensor using simple if/else branches.

class Int4PlainInt32Tensor(TorchAOBaseTensor):
    ...
    @classmethod
    def from_hp(
        cls,
        w: torch.Tensor,
        block_size: List[int],
    ):
        if w.device.type == "xpu":
            from_hp_xpu(cls, w, block_size)
        elif w.device.type == "npu":
            from_hp_npu(cls, w, block_size)
     
           
implements = Int4PlainInt32Tensor.implements
implements_torch_function = Int4PlainInt32Tensor.implements_torch_function

@implements(aten.linear.default)
@implements_torch_function(torch.nn.functional.linear)
def _(func, types, args, kwargs):
    input_tensor, weight_tensor, bias = (
        args[0],
        args[1],
        args[2] if len(args) > 2 else None,
    )
    
	if input_tensor.device.type == "xpu":
    	return linear_xpu(input_tensor, weight_tensor, bias)
    elif input_tensor.device.type == "npu":
        return linear_npu(input_tensor, weight_tensor, bias)

Did I get that right? Happy to hear any thoughts or suggestions you might have!

jerryzh168 · 2025-10-21T17:55:33Z

You mean that we should keep the entry logic in quant_api.py unchanged:
and then handle different backend implementations in the from_hp and linear methods of Int4PlainInt32Tensor using simple if/else branches.

Yes that's correct

orangeH25 · 2025-10-22T06:34:41Z

You mean that we should keep the entry logic in quant_api.py unchanged:
and then handle different backend implementations in the from_hp and linear methods of Int4PlainInt32Tensor using simple if/else branches.

Yes that's correct

Got it, I will follow this approach, thanks!

orangeH25 · 2025-10-22T09:59:05Z

You mean that we should keep the entry logic in quant_api.py unchanged:
and then handle different backend implementations in the from_hp and linear methods of Int4PlainInt32Tensor using simple if/else branches.

Yes that's correct

Hi @jerryzh168 @jcaip , I’ve made those changes, please take a look, really appreciate it!

orangeH25 · 2025-10-28T09:13:05Z

Hi @jerryzh168 @jcaip, could you please take another look when you have a moment? Thanks a lot!

jcaip

A couple nits but looks good to me @orangeH25 @fffrog

Can we set up the downstream testing you mentioned before we merge this?

jcaip · 2025-10-28T12:47:15Z

+
+    y = torch.ops.npu.npu_weight_quant_batchmatmul(
+        x=act_mat,
+        weight=packed_weight.contiguous().transpose(-1, -2),


do we want to call contiguous() every time we do matmul? should we save the packed_weight in contiguous format instead to only do this once?

Thanks! Addressed — packed_weight are now made contiguous once when constructing the Int4PlainInt32Tensor.

jcaip · 2025-10-28T12:53:34Z

+    elif input_tensor.device.type == "npu":
+        return _linear_npu(input_tensor, weight_tensor, bias)
+    else:
+        raise AssertionError(f"Int4PlainInt32Tensor does not support device '{input_tensor.device.type}' yet.")


nit: NotImplementedError or ValueError is better here

jcaip · 2025-10-28T12:58:33Z

+        elif w.device.type == "npu":
+            return _from_hp_npu(cls, w, block_size)
+        else:
+            raise AssertionError(f"Int4PlainInt32Tensor does not support device '{w.device.type}' yet.")


nit: ValueError or NotImplementedError here.

jcaip · 2025-10-28T13:07:32Z

+    assert int_data.shape[-1] % 8 == 0, (
+        f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}"
+    )
+


Will we ever run into a case where we have NPU support but this op is missing? Maybe in an earlier version of torch_npu? Should we throw a cleaner error message in that case?

It'd be good to add a comment here on where this op is defined and what version of torch npu is needed.

Thanks for the reminder — since torch and torch_npu versions are tightly coupled, I added

# Require PyTorch 2.7.1+ for NPU backend ops and backward compatibility. assert torch_version_at_least("2.7.1"), ( "Need PyTorch 2.7.1+ for NPU backend op support." )

at the beginning. Does this make the version requirement clear enough?

Let's be a tad more explicit, I want to make it clear it's PyTorch NPU >= 2.7.1 and not regular torch

assert (torch.accelerator.is_available() and torch.accelerator.current_accelerator().type == "npu" and torch_version_at_least("2.7.1"), ( f"PyTorch NPU 2.7.1+ needed for int4 packing and matmul ops, {torch.__version__} found" )

orangeH25 · 2025-10-29T06:55:30Z

Can we set up the downstream testing you mentioned before we merge this?

Sure! We’ll complete the downstream testing setup.

jcaip · 2025-10-29T14:44:53Z

Awesome, thank you! If you have any benchmarking numbers you can share as well that would be great :)

orangeH25 · 2025-10-30T02:38:53Z

If you have any benchmarking numbers you can share as well that would be great :)

Thanks! We’ll add some benchmarking results soon.

orangeH25 · 2025-10-31T10:09:01Z

Hi @jerryzh168 @jcaip

1 - The downstream testing has been set up, and the results are displayed in the README.md:

Third-party Pipeline Status

Backend	Inference
Ascend NPU

2 - We’ve completed some initial benchmarking and here are the results:

testing code:

import torch
from torch import nn
from torch.utils.benchmark import Timer
from torchao.quantization import quantize_, Int4WeightOnlyConfig
from copy import deepcopy

num_layers = 100
m = nn.Sequential(*[
    nn.Linear(256, 256, bias=False, dtype=torch.float16)
    for _ in range(num_layers)
]).to("npu")
x = torch.randn(16, 256,dtype=torch.float16, device="npu")

# eager baseline
_ = m(x)
_ = m(x)
eager_t = Timer(stmt="m(x)", globals={"m": m, "x": x}).blocked_autorange()

# quantized version
quantize_(m, Int4WeightOnlyConfig(group_size=64,int4_packing_format="plain_int32"))
_ = m(x)
_ = m(x)
quant_t = Timer(stmt="m(x)", globals={"m": m, "x": x}).blocked_autorange()

print(f"Speedup: {eager_t.mean / quant_t.mean:.2f}x")

Current Status:

Functionality is complete.
Performance is about 60% of the non-quantized version.

Reason:

The performance degradation is primarily due to the fact that the current implementation of npu_weight_quant_batchmatmul still performs dequantization before computation.
This implementation is mainly user-driven — the priority is to enable end-to-end functionality and unblock user workflows. We are aware of the performance gap and will continue optimizing the operator to support direct computation on quantized weights in future updates.

jerryzh168 · 2025-10-31T20:22:12Z

 @unittest.skipIf(not torch_version_at_least("2.8.0"), "Need pytorch 2.8+")
 @unittest.skipIf(not torch.xpu.is_available(), "XPU not available")
-class Int4PlainInt32Tensor(TestCase):
+class Int4PlainInt32TensorXPU(TestCase):


nit: revert change

jerryzh168 · 2025-10-31T20:22:39Z

+    or not torch.accelerator.is_available(),
+    "NPU not available",
+)
+class Int4PlainInt32TensorNPU(TestCase):


can this be merged with the xpu case?

Sure, I’ll combine them into a single test class.

jerryzh168

changes looks good to me, just wondering if we can merge the tests

jerryzh168 · 2025-10-31T20:24:39Z

also looks like a lot of broken CI, please make sure these passes

jerryzh168 · 2025-10-31T20:25:30Z

also for benchmark, maybe @jainapurva can you share some guide for adding benchmark for these new things in torchao?

orangeH25 · 2025-11-03T06:56:43Z

changes looks good to me, just wondering if we can merge the tests
also looks like a lot of broken CI

Thanks! Yes, I’ll go ahead and merge them, and address the CI failures in the same update.

fffrog · 2025-11-03T09:41:08Z


-@unittest.skipIf(not torch_version_at_least("2.8.0"), "Need pytorch 2.8+")
-@unittest.skipIf(not torch.xpu.is_available(), "XPU not available")
+_MIN_VER = {


@orangeH25 Thank you for the update.

PyTorch has a helper function called instantiate_device_type_tests that can help you easily automate the creation of device test cases, which should meet your needs.

orangeH25 · 2025-11-04T12:44:19Z

Hi @jerryzh168 @jcaip, the XPU and NPU test cases have already been merged. Please take a look — feedback welcome. Thanks!

fffrog

LGTM, thank you.

orangeH25 · 2025-11-11T11:06:15Z

Hi @jerryzh168 @jcaip, just a gentle ping on this one when you get a chance.
Let me know if there’s anything else I should update. Thanks!

jcaip

Changes look good to me, but we should put the docs in the quantization README and not the top level README.

jcaip · 2025-11-11T14:43:30Z


 Check out our [docs](https://docs.pytorch.org/ao/main/) for more details!

+## Third-party Pipeline Status


Can we add this and mention NPU support requirements (torch_npu >=2.7.1) in the quantization README instead? I would put here: https://github.com/pytorch/ao/tree/main/torchao/quantization#a16w4-weightonly-quantization

Sure, I’ll add a subheading under the a16w4-weightonly-quantization section for this

jerryzh168 · 2025-11-11T19:02:02Z

+        if "xpu" in device and dtype == torch.float16:
+            pytest.skip(f"{device} test_activation_prescaling don't test {dtype}")
+
+        threshold = thresholds.get(device.split(":")[0])


does device_type have :? I thought it should only be things like xpu, npu, cuda, not cuda:0

Ah, good catch — you’re right! I actually meant to use the function argument device there, but forgot to remove device = self.device_type from the setup.
device can include the suffix like ":0", while device_type should not. I’ll fix that, thanks for pointing it out!

jerryzh168

looks good overall I think, thanks

orangeH25 · 2025-11-12T03:07:35Z

It looks like some CI failures may be caused by my branch not being up to date with the latest main. I’ll rebase it

orangeH25 · 2025-11-12T03:57:52Z

Hi @jerryzh168 @jcaip, please take a look, thanks!

orangeH25 · 2025-11-17T12:15:51Z

Hi @jerryzh168 @jcaip, gentle ping on this PR — happy to make any changes needed. Thanks for your time!

jcaip · 2025-11-18T23:03:38Z

merging, ci failure looks unrelated. Thanks again for working on this @orangeH25 @fffrog! Let us know how the performance work progresses, we can give a shout-out in GPU-MODE or something once it's good.

orangeH25 · 2025-11-19T01:41:08Z

@jerryzh168 @jcaip , thanks a lot! We’ll keep pushing on the performance side and share updates once we have solid numbers. Thanks again for all the guidance!

…rkflow (pytorch#3172) * Add NPU (Ascend) backend support for INT4 weight-only quantization workflow * use torch.ops.npu prefix and drop redundant torch_npu import * Modify test file and update comments * add: merge NPU(Ascend) backend logic in Int4PlainInt32Tensor subclass * ruff format cleanup, replace error types, add torch version check * add torch_npu version assertion and show downstream testing result * add downstream testing result * unify NPU and XPU test cases into a single class * move CI display to quantization README and update test file

orangeH25 added 3 commits October 13, 2025 11:07

Add NPU (Ascend) backend support for INT4 weight-only quantization wo…

f3aefca

…rkflow

use torch.ops.npu prefix and drop redundant torch_npu import

68eea61

Merge branch 'pytorch:main' into quant/int4/wo/0

164435e

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2025

orangeH25 marked this pull request as draft October 14, 2025 11:44

fffrog reviewed Oct 14, 2025

View reviewed changes

Modify test file and update comments

06c77d1

orangeH25 marked this pull request as ready for review October 15, 2025 06:28

jcaip reviewed Oct 20, 2025

View reviewed changes

jerryzh168 reviewed Oct 20, 2025

View reviewed changes

Merge branch 'pytorch:main' into quant/int4/wo/0

498f052

add: merge NPU(Ascend) backend logic in Int4PlainInt32Tensor subclass

ea2aa7a

orangeH25 force-pushed the quant/int4/wo/0 branch from 7808297 to ea2aa7a Compare October 22, 2025 09:57

jcaip reviewed Oct 28, 2025

View reviewed changes

ruff format cleanup, replace error types, add torch version check

ca8f056

orangeH25 added 2 commits October 31, 2025 09:52

add torch_npu version assertion and show downstream testing result

05af947

add downstream testing result

25360da

jerryzh168 reviewed Oct 31, 2025

View reviewed changes

jerryzh168 approved these changes Oct 31, 2025

View reviewed changes

fffrog reviewed Nov 3, 2025

View reviewed changes

unify NPU and XPU test cases into a single class

fa3220f

orangeH25 force-pushed the quant/int4/wo/0 branch from 38b1f49 to fa3220f Compare November 4, 2025 12:38

fffrog approved these changes Nov 5, 2025

View reviewed changes

jcaip approved these changes Nov 11, 2025

View reviewed changes

jerryzh168 reviewed Nov 11, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor.py

jerryzh168 approved these changes Nov 11, 2025

View reviewed changes

orangeH25 added 2 commits November 12, 2025 03:48

move CI display to quantization README and update test file

623c589

Merge branch 'pytorch:main' into quant/int4/wo/0

89ad729

jcaip added the topic: new feature Use this tag if this PR adds a new feature label Nov 18, 2025

jcaip merged commit 7a2a7b3 into pytorch:main Nov 18, 2025
24 of 27 checks passed

		@unittest.skipIf(torch_npu is None, "torch_npu is not available")
		@unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available")

	@unittest.skipIf(torch_npu is None, "torch_npu is not available")
	@unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available")
	@unittest.skipIf(torch.accelerator.current_accelerator(True).type == "npu" and torch.accelerator.is_available(), "NPU not available")

	f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype"
	f"torch.ops.npu.npu_convert_weight_to_int4pack expects `int32` dtype"


		Check out our [docs](https://docs.pytorch.org/ao/main/) for more details!

		## Third-party Pipeline Status

Conversation

orangeH25 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Environment

Files changed

Implementation Overview

Test Case

Uh oh!

pytorch-bot Bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3172

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fffrog Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orangeH25 commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcaip left a comment

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orangeH25 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Oct 21, 2025

Uh oh!

orangeH25 commented Oct 22, 2025

Uh oh!

orangeH25 commented Oct 22, 2025

Uh oh!

orangeH25 commented Oct 28, 2025

Uh oh!

jcaip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orangeH25 commented Oct 29, 2025

Uh oh!

jcaip commented Oct 29, 2025

Uh oh!

orangeH25 commented Oct 30, 2025

Uh oh!

orangeH25 commented Oct 31, 2025

Third-party Pipeline Status

orangeH25 commented Oct 14, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 14, 2025 •

edited

Loading

fffrog Oct 14, 2025 •

edited

Loading

orangeH25 commented Oct 21, 2025 •

edited

Loading

jerryzh168 Oct 31, 2025 •

edited

Loading