Align Int4Tensor implementation details with the design of Float8Tensor#2687
Align Int4Tensor implementation details with the design of Float8Tensor#2687jerryzh168 merged 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2687
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ba62d8e with merge base c086ade ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
1d84542 to
4874773
Compare
4874773 to
1beccb0
Compare
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
| res = torch.ops.fbgemm.bf16i4bf16_rowwise( | ||
| input_tensor, | ||
| weight_tensor._data.contiguous(), | ||
| weight_tensor.qdata.contiguous(), |
There was a problem hiding this comment.
is it expected that the tensors are not contiguous? if not, can we assert for this instead of calling contiguous?
There was a problem hiding this comment.
the non-contiguous comes from the reshape ops like transpose, view I think, but the kernel will need these to be contiguous, I can try changing these to assert and do the contiguous operation in user side to see if it works
There was a problem hiding this comment.
I would have expected the weights to be stored in a format aligned with what the kernel needs, without any need for just-in-time layout transforms. Does this match how the current code works?
There was a problem hiding this comment.
normally it is, but the weights also goes through some transformations like the ones we listed in test_moe_weight_reshape_ops which makes weight / scale etc. non-contiguous I think, but I can try to do call contiguous in user code, that might be cleaner I think
There was a problem hiding this comment.
turns out the contiguous is not implemented properly, just fixed that and we can remove contiguous calls in linear/bmm now
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
1beccb0 to
5f6306e
Compare
|
|
||
|
|
||
| @register_quantize_module_handler(TestOnlyMoEQuantConfig) | ||
| def moe_quant_fn(module, config: TestOnlyMoEQuantConfig): |
There was a problem hiding this comment.
this is really confusing, could you share the result of print(model) after this function has been applied?
if it's going to print model with parameters wrapped in Int4Tensor, can we just wrap the parameters directly without all of these layers of abstraction?
There was a problem hiding this comment.
if this is working around the fact that quantize_ needs to work on modules, IMO we should change quantize_ to handle this instead of working around? seems important for MoEs.
There was a problem hiding this comment.
yeah the parameters are wrapped in Int4Tensor, this is just applying quantization to each of the moe weights: w1, w2 and w3
I can inline these for now. can follow up with how to have an API for weights + configs separately
There was a problem hiding this comment.
probably not worth changing API right now since MoE quant is also moving, let me know if current code looks good
| return model, input_data | ||
|
|
||
|
|
||
| class Experts(nn.Module): |
There was a problem hiding this comment.
maybe call it something like FeedForwardWithExperts? Experts is ambiguous
There was a problem hiding this comment.
this is adapted from https://github.com/meta-llama/llama-models/blob/a9c89c471f793423afd4cc3ca8671d6e56fe64cb/models/llama4/moe.py#L22, how about renaming to LLama4Experts to make it more specific
5f6306e to
6bd3106
Compare
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
7868bcf to
ceac84c
Compare
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
aee3dbb to
4a50bf7
Compare
4a50bf7 to
7a21719
Compare
7a21719 to
0040a5f
Compare
0040a5f to
f9695e4
Compare
f9695e4 to
e84a76d
Compare
8fb7215 to
5bb2fd4
Compare
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
e84a76d to
8983652
Compare
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
8983652 to
abcddcf
Compare
| tensor_data_attrs = ["_data", "scale", "zero_point"] | ||
| tensor_attributes = ["block_size", "shape"] | ||
| tensor_data_names = ["qdata", "scale", "zero_point"] | ||
| tensor_attribute_names = ["block_size", "shape"] |
There was a problem hiding this comment.
tensor_attribute_names is a lil weird to me
are these attributes that are tensors and thus should go int he right unflatten location?
There was a problem hiding this comment.
yeah.. this somewhat follows what unflatten and flatten functions names these things, the tensor means the tensor subclass instance, meaning the attributes of the tensor subclass instance, instead of "tensor attributes"
I could remove tensor_ as well to make it less confusing? probably better to do in a separate PR
| @unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available") | ||
| @unittest.skipIf(not is_sm_at_least_90(), "Nedd sm90+") | ||
| class TestInt4Tensor(TestCase): | ||
| class TestInt4Tensor(TorchAOIntegrationTestCase): |
There was a problem hiding this comment.
are we keeping both old and new? If we are keeping the old version working, I would expect this test case to not have any changes, as it would test the old version.
There was a problem hiding this comment.
old version meaning the version using AQT? we are keeping AQT, but this test does not test the AQT path, it only tests the new Int4Tensor, and we are updating Int4Tensor in this PR
There was a problem hiding this comment.
can we update the PR summary with context on this? Migrations are always confusing and clearly laying out what is changing with BC, what is breaking BC, and what is not changing will help get a good review.
There was a problem hiding this comment.
OK added, no BC related changes in this PR
Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * defined `tensor_data_names` and `tensor_attribute_names` so we can remove some of the implementations from TorchAOBaseTensor * Migrated op implementation and tests from #2387 Note: This is just refactoring Int4Tensor, no BC related changes in this PR Int4Tensor path is exposed in version 2 of `Int4WeightOnlyConfig` (default version is still 1, which is using the old AQT path Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16
Stacked PRs:
optional_tensor_namesin TorchAOBaseTensor #2710Align Int4Tensor implementation details with the design of Float8Tensor
Summary:
Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N]
Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops)
tensor_data_namesandtensor_attribute_namesso we can remove some of the implementations from TorchAOBaseTensorNote: This is just refactoring Int4Tensor, no BC related changes in this PR
Int4Tensor path is exposed in version 2 of
Int4WeightOnlyConfig(default version is still 1, which is using the old AQT pathTest Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tensor.py
Reviewers:
Subscribers:
Tasks:
Tags: