[CPU] Introduce Int4OpaqueTensor to replace Int4CPULayout in AQT by Xia-Weiwen · Pull Request #2798 · pytorch/ao

Xia-Weiwen · 2025-08-19T03:15:43Z

Summary
This PR adds Int4OpaqueTensor to replace the AQT tensor with Int4CPULayout since AQT will be deprecated.

Test plan

pytest -sv test/quantization/quantize_/workflows/int4/test_int4_opaque_tensor.py

pytorch-bot · 2025-08-19T03:15:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2798

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e9a5fa7 with merge base 9056c46 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull Request Overview

This PR introduces Int4WoqCpuTensor as a replacement for AQT tensor with Int4CPULayout to support int4 weight-only quantization on CPU with groupwise quantization.

Key changes:

Adds new Int4WoqCpuTensor class for CPU-specific int4 weight-only quantization
Integrates the new tensor type into the quantization API and workflow system
Adds comprehensive test coverage for the new tensor implementation

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py	Implements the core `Int4WoqCpuTensor` class with CPU-optimized int4 quantization
torchao/quantization/quantize_/workflows/init.py	Adds export for the new tensor class
torchao/quantization/quantize_/common/packing_format.py	Adds `INT4_WOQ_CPU` packing format enum value
torchao/quantization/quant_api.py	Integrates new tensor into quantization workflow
torchao/quantization/init.py	Adds public API export for the tensor class
test/quantization/quantize_/workflows/int4/test_int4_woq_cpu_tensor.py	Comprehensive test suite for the new tensor implementation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

jerryzh168 · 2025-08-19T16:05:00Z

+    """
+    int4_woq_cpu is referring to the format used by int4 weight-only quantization on CPU, which is a groupwise quantization format.
+    """
+    INT4_WOQ_CPU = "int4_woq_cpu"


nit: we need to change the name to describe how the quantized data is laid out / packed I think

Sure. I have changed it to int4_tinygemm_cpu

jerryzh168 · 2025-08-19T16:10:09Z

+    packed_weight = weight_tensor.qdata.contiguous()
+    scale_and_zero = weight_tensor.qscale_and_zero.contiguous()


are these contiguous call needed?

Removed. Thanks.

jerryzh168 · 2025-08-19T16:11:44Z

+@unittest.skipIf(not torch_version_at_least("2.6.0"), "Need pytorch 2.6+")
+class TestInt4WoqCpuTensor(TestCase):
+    @parametrize("group_size", [32, 64, 128])
+    def test_linear(self, group_size):


can you align with

ao/test/quantization/quantize_/workflows/int4/test_int4_marlin_sparse_tensor.py

Line 52 in 72b35bf

def test_linear(self, config, sizes):

to test more input shapes and also add a compile test as well

Sure. Updated.

jerryzh168 · 2025-08-20T00:01:10Z

+      For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64. And when packing
+      the last dimension, data are shuffled by lanes before packing two int4 to one int8:
+      block_n = 64 = 16 * 4, so we have 4 lanes, each lane has 16 int4s = [lane0, lane1, lane2, lane3]. We pack them as [lane0|lane2, lane1|lane3].
+      See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details.


also if this is based on hardware at the quantization time, what do we do if users quantize the model in one CPU and want to run the model in another CPU?

Thanks for the question. The packing and computing kernel can co-work on different machines but performance is not guaranteed.

jerryzh168 · 2025-08-20T20:49:26Z

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

Xia-Weiwen · 2025-08-21T01:18:11Z

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

Ok, sure.

Xia-Weiwen · 2025-08-21T01:25:42Z

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks.
I think we can add restrictions that the layout can only be available on platforms with AVX512 support.

jerryzh168 · 2025-08-21T01:44:13Z

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks. I think we can add restrictions that the layout can only be available on platforms with AVX512 support.

packing is typically specific to kernel I think, for int4, right now we have tensor core tiled packing (for tinygemm kernel) and preshuffled packing for fbgemm preshuffled kernel, and there is another plain packing for fbgemm non-preshuffled kernel

main thing is it's a fixed format and can be explained and understood in torchao

Xia-Weiwen · 2025-08-21T01:50:00Z

main thing is it's a fixed format and can be explained and understood in torchao

I see. So how about requiring AVX512?

jerryzh168 · 2025-08-21T02:03:08Z

main thing is it's a fixed format and can be explained and understood in torchao

I see. So how about requiring AVX512?

that sounds OK, although I saw you have other hardware situations as well like AVX2 and non-vectorized, so we need one for each packing. please check slack message, need your input on the refactor effort

jerryzh168 · 2025-08-23T00:26:57Z

+    # groupwise int4 quantization
+    groupsize = weight_tensor.block_size[1]
+    y = torch.ops.aten._weight_int4pack_mm_for_cpu(
+        act_mat.contiguous(), packed_weight, groupsize, scale_and_zero


is this contiguous call needed?

I think so. We assume input is contiguous in the kernel.

jerryzh168 · 2025-08-23T00:49:28Z

+        quantized_and_compiled = compiled_linear(input)
+        self.assertTrue(compute_error(original, quantized_and_compiled) > 20)
+
+    @parametrize("dtype", [torch.float, torch.bfloat16, torch.half])


nit: torch.float32, torch.bfloat16, torch.float16 will be clearer I feel

Thanks. Updated.

jerryzh168 · 2025-08-26T00:03:29Z

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

Xia-Weiwen · 2025-08-26T01:32:54Z

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future?

jerryzh168 · 2025-08-26T01:53:46Z

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future?

just use the same one I think, we can add more Opaque format if more are needed I feel

Xia-Weiwen · 2025-08-26T02:13:15Z

just use the same one I think, we can add more Opaque format if more are needed I feel

It's OK for me. However, the name sounds too general. Anyway, I will change the names to opaque then please check if that looks good to you. Thanks.

Xia-Weiwen · 2025-08-26T02:34:27Z

Hi @jerryzh168 Please review again. Thanks.

jerryzh168 · 2025-08-26T03:05:42Z

sorry, this should be Int4OpqueTensor can you update

It's updated

jerryzh168 · 2025-08-26T03:06:22Z

+    This is an opaque tensor subclass, the packing format is not exposed to the rest of the system. See the note below for more details.
+
+    Tensor Attributes:
+        qdata: preshuffled and packed int4 weight for tinygemm, always viewed as a 2D (N, K/2) tensor, last dimension is packed


also tinygemm is a gpu library I think, is this really related to tinygemm?

We are reusing the "tinygemm" name for CPU in torch core I think. I can change it to the following if it's ok to you:
qdata: preshuffled and packed int4 weight for CPU tinygemm, always viewed as a 2D (N, K/2) tensor, ...

OK, sounds good

jerryzh168

LGTM, thanks!

[CPU] Introduce Int4WoqCpuTensor to replace Int4CPULayout in AQT

e092ec1

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 19, 2025

Merge branch 'main' into migrate_int4_cpu_layout

79828f0

Xia-Weiwen requested a review from Copilot August 19, 2025 03:19

Xia-Weiwen added the topic: new feature Use this tag if this PR adds a new feature label Aug 19, 2025

This comment was marked as outdated.

Sign in to view

refine code

6c16b26

Xia-Weiwen requested a review from Copilot August 19, 2025 03:26

Copilot AI reviewed Aug 19, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py

Comment thread torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py

refine code

9012a61

Xia-Weiwen marked this pull request as ready for review August 19, 2025 03:30

Xia-Weiwen requested a review from jerryzh168 August 19, 2025 03:30

jerryzh168 reviewed Aug 19, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py Outdated

jerryzh168 reviewed Aug 19, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py Outdated

jerryzh168 reviewed Aug 19, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py Outdated

jerryzh168 reviewed Aug 19, 2025

View reviewed changes

jerryzh168 reviewed Aug 20, 2025

View reviewed changes

Refine code

d41e0a8

Xia-Weiwen requested a review from jerryzh168 August 20, 2025 02:43

Xia-Weiwen changed the title ~~[CPU] Introduce Int4WoqCpuTensor to replace Int4CPULayout in AQT~~ [CPU] Introduce Int4TinyGemmCpuTensor to replace Int4CPULayout in AQT Aug 20, 2025

jerryzh168 reviewed Aug 23, 2025

View reviewed changes

Update UT

969c46a

Xia-Weiwen changed the title ~~[CPU] Introduce Int4TinyGemmCpuTensor to replace Int4CPULayout in AQT~~ [CPU] Introduce OpaqueTensor to replace Int4CPULayout in AQT Aug 26, 2025

Xia-Weiwen requested a review from jerryzh168 August 26, 2025 02:34

jerryzh168 reviewed Aug 26, 2025

View reviewed changes

Xia-Weiwen requested a review from jerryzh168 August 26, 2025 05:07

Xia-Weiwen changed the title ~~[CPU] Introduce OpaqueTensor to replace Int4CPULayout in AQT~~ [CPU] Introduce Int4OpaqueTensor to replace Int4CPULayout in AQT Aug 26, 2025

Xia-Weiwen added 3 commits August 26, 2025 10:10

Merge branch 'main' into migrate_int4_cpu_layout

80e2e41

Rename tensor & format to opaque

ade2c32

Rename OpaqueTensor -> Int4OpaqueTensor

c81880e

jerryzh168 approved these changes Aug 26, 2025

View reviewed changes

jerryzh168 requested review from andrewor14, danielvegamyhre, drisspg, jainapurva, liangel-02 and vkuzo August 26, 2025 18:10

Xia-Weiwen merged commit 6f035e8 into pytorch:main Aug 27, 2025
18 checks passed

Merge branch 'main' into migrate_int4_cpu_layout

e9a5fa7

Xia-Weiwen mentioned this pull request Sep 1, 2025

[CPU] Fix AWQ on CPU after refactoring #2688

Closed

		packed_weight = weight_tensor.qdata.contiguous()
		scale_and_zero = weight_tensor.qscale_and_zero.contiguous()

Conversation

Xia-Weiwen commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2798

✅ No Failures

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Aug 20, 2025

Uh oh!

Xia-Weiwen commented Aug 21, 2025

Uh oh!

Xia-Weiwen commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Aug 21, 2025

Uh oh!

Xia-Weiwen commented Aug 21, 2025

Uh oh!

jerryzh168 commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Aug 26, 2025

Uh oh!

Xia-Weiwen commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Aug 26, 2025

Uh oh!

Xia-Weiwen commented Aug 26, 2025

Uh oh!

Xia-Weiwen commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Xia-Weiwen commented Aug 19, 2025 •

edited

Loading

pytorch-bot Bot commented Aug 19, 2025 •

edited

Loading

Xia-Weiwen commented Aug 21, 2025 •

edited

Loading

Xia-Weiwen commented Aug 26, 2025 •

edited

Loading

Xia-Weiwen Aug 26, 2025 •

edited

Loading