introduce new int8 quantization API#3241
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3241
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| ) | ||
|
|
||
| @common_utils.parametrize("dtype", [torch.bfloat16, torch.float16]) | ||
| def test_quantization_shapes(self, dtype): |
There was a problem hiding this comment.
this seems to be a combination of two tests, one for dynamic quant one for static quant, can you use something like this:
also I feel it might be better to not add static quant in this PR, and in a separate PR add both the tensor support and config support for static quant
There was a problem hiding this comment.
Okay, not sure to remove static flags (although its not fully implemented) before, but small PR should be always better I feel. I will remove static_scale and all those supports.
| if act_quant_kwargs is not None and act_quant_kwargs.static_scale is not None: | ||
| # INT8 × INT8 (static) | ||
| scale = act_quant_kwargs.static_scale | ||
| zero_point = torch.zeros_like(scale, dtype=torch.int8) |
There was a problem hiding this comment.
I think user should specify static_zero_point as well
but again, it's better to do this in a separate PR, since current state is a half of the static quant feature (no config)
jerryzh168
left a comment
There was a problem hiding this comment.
I think we should
- split the static quant support to separate PR
- follow what https://github.com/pytorch/ao/blob/main/torchao/dtypes/uintx/plain_layout.py is doing for quantized linear implementation
this should be a refactor PR, not a refactor + some extra modifications + some feature implementations I think
| aten = torch.ops.aten | ||
|
|
||
| # Unsupported case for now, this would be 1 scale per data element | ||
| # Per-tensor quantization (scalar scale) |
There was a problem hiding this comment.
is this change related?
There was a problem hiding this comment.
It is updated to support more granularity. Without this change, we can't use per-tensor (0D scale) and per-row (1D scale).
above comment is incorrect and this change is unrelated; #3241
There was a problem hiding this comment.
So maybe it's better to move this util function to a common place?
There was a problem hiding this comment.
this can be moved to torchao/quantization/quantize_/common/utils.py I think
There was a problem hiding this comment.
Okay, then I will move this to torchao/quantization/quantize_/common/utils.py after this PR.
There was a problem hiding this comment.
thanks, I think the tensor changes looks good, but need to make a linear_variants tests to make sure we cover different aspects of things (e.g. compile), see comments inline
can you also do a e2e perf check with https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py to make sure the performance are the same before and after change for vit model?
also add a kernel check might be useful to make sure we don't regress things:
|
Updated logs:
|
|
Hi @namgyu-youn Do you plan to submit another PR for static quantization? We also need static quantization for SmoothQuant. So, we are wondering if you have a plan or we should consider adding it ourselves. Thanks. CC @cyxlily |
Yeah, static quantization support using static/dynamic flags is planned; I hope to show it to your team in the foreseeable future. Also, in the SmoothQuant case, validating its support for the new quantization APIs (below) has higher priority, I think. Could you look into it?
|
Thanks. Looking forward to it. If there is anything we can help with, please let us know.
By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant. |
Oh yes, it was a typo (W8A16 is right), and W4A16-INT ( Because current AWQ/SmoothQuant test is only working with old APIs (version 1), we can replace it with new APIs like |
I see. Thanks. We will evaluate that. |
|
Hi @namgyu-youn May I know if you have a timeline to land this? Thanks. |
| kernels = {} | ||
|
|
||
| # Check for Triton kernels | ||
| if "torch.ops.triton" in code[0]: |
There was a problem hiding this comment.
should add some asserts I think
|
Updated logs:
|
| self.assertEqual(weight2.qdata, dummy.weight.qdata.narrow(1, 0, slice_sizes[1])) | ||
|
|
||
| # Int8DynamicActivationInt8WeightConfig uses per-row (PerRow) | ||
| # Int8WeightOnlyConfig uses per-tensor (PerTensor) |
There was a problem hiding this comment.
it should be per row I think?
ao/torchao/quantization/quant_api.py
Line 1343 in 6815e57
| ) | ||
| else: | ||
| assert config.version == 2, f"Unexpected version: {config.version}" | ||
| block_size = [weight.shape[0], weight.shape[1]] |
There was a problem hiding this comment.
why does this default to per tensor? I think it should follow the existing logic from L1376-1378?
2be7344 to
3441751
Compare
|
@jerryzh168 Added a few TODO for not addressed comments, and I will no longer work here as we discussed at #3241 (comment); let me know if there is anything I can help. |
There was a problem hiding this comment.
TODO:
- [before landing the PR] revert changes to
float8/inference.py, implement the slicing logic in the int8tensor itself
following can be after landing the PR since v2 is not used yet:
- fix granularity typing to just use Granularity, follow float8 opaque tensor to handle granularity, i.e. have a normliaze_and_validate function: , we can split out the normalize function in the future
- allow different input activation quantization settings (symmetric / asymmetric) as shown in (can include the weight only decode setting as well, but optional)
ao/torchao/quantization/quant_api.py
Lines 1566 to 1569 in 3ad4d0a
- [optional] add back the changes and use it to quantize activation
Resolved this TODO for early land, please take a look. |
jerryzh168
left a comment
There was a problem hiding this comment.
looks good I think, thanks @namgyu-youn!
|
@namgyu-youn somehow the CI is not triggered, can you rebase again so we can trigger CI? |
**Summary:** Add PerBlock to safe globals so users don't have to do this themselves when they load config.json with PerBlock. ``` WeightsUnpickler error: Unsupported global: GLOBAL torchao.quantization.granularity.PerBlock was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torchao.quantization.granularity.PerBlock])` or the `torch.serialization.safe_globals([torchao.quantization.granularity.PerBlock])` context manager to allowlist this global if you trust this class/function. ``` **Test Plan:** ``` python test/core/test_config.py -k test_granularity_serialization ```
b85164f to
9906ff5
Compare
Finished rebase to trigger CI, please take a look. |
|
somehow I still didn't see an option to run all the CI jobs, @namgyu-youn do you mind close this one and reopen another PR to see if it helps? |
Reland at #3391, please take a look. Also, does adding a trigger (
name: Test Workflow Dispatch
on:
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- run: echo "Workflow dispatch works!"
name: Test CI for push
on:
push:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- run: echo "LGTM"
|
|
thanks, that workflow dispatch seems only like a test, doesn't look real |
|
OK I think it's some issue with our configs, let me check |
Summary: Introduce a new tensor subclass API. The main features are Int8Tensor: Main API, which handles quantization and dequantization operations Utility operation functions: Tensor slice, index selection This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default. Related Issue/PR: #3241 (reland) Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py PERF Test: https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32: API With torch.compile Without torch.compile Old 65.47 ms 234.39 ms New 63.30 ms 239.30 ms Future Plan: #3241 (review)
Summary: Introduce a new tensor subclass API. The main features are Int8Tensor: Main API, which handles quantization and dequantization operations Utility operation functions: Tensor slice, index selection This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default. Related Issue/PR: pytorch#3241 (reland) Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py PERF Test: https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32: API With torch.compile Without torch.compile Old 65.47 ms 234.39 ms New 63.30 ms 239.30 ms Future Plan: pytorch#3241 (review)


Summary:
Introduce a new tensor subclass API. The main features are
Int8Tensor: Main API, which handles quantization and dequantization operationsThis api is integrated into global variants (
Int8WeightOnlyConfig,Int8DynamicActivationInt8WeightConfig) usingversion, and not defined as a default.Related Issue/PR: #3038 (reland)
Test plan:
test/quantization/quantize_/workflows/int8/test_int8_tensor.py
Performance:
The following are the results of https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32:
torch.compiletorch.compile