[CPU] Introduce Int4OpaqueTensor to replace Int4CPULayout in AQT#2798
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2798
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e9a5fa7 with merge base 9056c46 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces Int4WoqCpuTensor as a replacement for AQT tensor with Int4CPULayout to support int4 weight-only quantization on CPU with groupwise quantization.
Key changes:
- Adds new
Int4WoqCpuTensorclass for CPU-specific int4 weight-only quantization - Integrates the new tensor type into the quantization API and workflow system
- Adds comprehensive test coverage for the new tensor implementation
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py | Implements the core Int4WoqCpuTensor class with CPU-optimized int4 quantization |
| torchao/quantization/quantize_/workflows/init.py | Adds export for the new tensor class |
| torchao/quantization/quantize_/common/packing_format.py | Adds INT4_WOQ_CPU packing format enum value |
| torchao/quantization/quant_api.py | Integrates new tensor into quantization workflow |
| torchao/quantization/init.py | Adds public API export for the tensor class |
| test/quantization/quantize_/workflows/int4/test_int4_woq_cpu_tensor.py | Comprehensive test suite for the new tensor implementation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| """ | ||
| int4_woq_cpu is referring to the format used by int4 weight-only quantization on CPU, which is a groupwise quantization format. | ||
| """ | ||
| INT4_WOQ_CPU = "int4_woq_cpu" |
There was a problem hiding this comment.
nit: we need to change the name to describe how the quantized data is laid out / packed I think
There was a problem hiding this comment.
Sure. I have changed it to int4_tinygemm_cpu
| packed_weight = weight_tensor.qdata.contiguous() | ||
| scale_and_zero = weight_tensor.qscale_and_zero.contiguous() |
There was a problem hiding this comment.
are these contiguous call needed?
There was a problem hiding this comment.
Removed. Thanks.
| @unittest.skipIf(not torch_version_at_least("2.6.0"), "Need pytorch 2.6+") | ||
| class TestInt4WoqCpuTensor(TestCase): | ||
| @parametrize("group_size", [32, 64, 128]) | ||
| def test_linear(self, group_size): |
There was a problem hiding this comment.
can you align with
to test more input shapes and also add a compile test as well| For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64. And when packing | ||
| the last dimension, data are shuffled by lanes before packing two int4 to one int8: | ||
| block_n = 64 = 16 * 4, so we have 4 lanes, each lane has 16 int4s = [lane0, lane1, lane2, lane3]. We pack them as [lane0|lane2, lane1|lane3]. | ||
| See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details. |
There was a problem hiding this comment.
also if this is based on hardware at the quantization time, what do we do if users quantize the model in one CPU and want to run the model in another CPU?
There was a problem hiding this comment.
Thanks for the question. The packing and computing kernel can co-work on different machines but performance is not guaranteed.
|
we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you |
Ok, sure. |
BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks. |
packing is typically specific to kernel I think, for int4, right now we have tensor core tiled packing (for tinygemm kernel) and preshuffled packing for fbgemm preshuffled kernel, and there is another plain packing for fbgemm non-preshuffled kernel main thing is it's a fixed format and can be explained and understood in torchao |
I see. So how about requiring AVX512? |
that sounds OK, although I saw you have other hardware situations as well like AVX2 and non-vectorized, so we need one for each packing. please check slack message, need your input on the refactor effort |
| # groupwise int4 quantization | ||
| groupsize = weight_tensor.block_size[1] | ||
| y = torch.ops.aten._weight_int4pack_mm_for_cpu( | ||
| act_mat.contiguous(), packed_weight, groupsize, scale_and_zero |
There was a problem hiding this comment.
is this contiguous call needed?
There was a problem hiding this comment.
I think so. We assume input is contiguous in the kernel.
| quantized_and_compiled = compiled_linear(input) | ||
| self.assertTrue(compute_error(original, quantized_and_compiled) > 20) | ||
|
|
||
| @parametrize("dtype", [torch.float, torch.bfloat16, torch.half]) |
There was a problem hiding this comment.
nit: torch.float32, torch.bfloat16, torch.float16 will be clearer I feel
There was a problem hiding this comment.
Thanks. Updated.
|
Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format |
Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future? |
just use the same one I think, we can add more Opaque format if more are needed I feel |
It's OK for me. However, the name sounds too general. Anyway, I will change the names to opaque then please check if that looks good to you. Thanks. |
|
Hi @jerryzh168 Please review again. Thanks. |
There was a problem hiding this comment.
sorry, this should be Int4OpqueTensor can you update
| This is an opaque tensor subclass, the packing format is not exposed to the rest of the system. See the note below for more details. | ||
|
|
||
| Tensor Attributes: | ||
| qdata: preshuffled and packed int4 weight for tinygemm, always viewed as a 2D (N, K/2) tensor, last dimension is packed |
There was a problem hiding this comment.
also tinygemm is a gpu library I think, is this really related to tinygemm?
There was a problem hiding this comment.
We are reusing the "tinygemm" name for CPU in torch core I think. I can change it to the following if it's ok to you:
qdata: preshuffled and packed int4 weight for CPU tinygemm, always viewed as a 2D (N, K/2) tensor, ...
Summary
This PR adds
Int4OpaqueTensorto replace the AQT tensor withInt4CPULayoutsince AQT will be deprecated.Test plan