introduce new int8 quantization API by namgyu-youn · Pull Request #3241 · pytorch/ao

namgyu-youn · 2025-10-24T12:50:23Z

Summary:
Introduce a new tensor subclass API. The main features are

Int8Tensor: Main API, which handles quantization and dequantization operations
Utility operation functions: Tensor slice, index selection

This api is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default.

Related Issue/PR: #3038 (reland)

Test plan:
test/quantization/quantize_/workflows/int8/test_int8_tensor.py

Performance:
The following are the results of https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32:

API	With `torch.compile`	Without `torch.compile`
Old	65.47 ms	234.39 ms
New	63.30 ms	239.30 ms

pytorch-bot · 2025-10-24T12:50:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3241

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-10-24T17:43:01Z

+        )
+
+    @common_utils.parametrize("dtype", [torch.bfloat16, torch.float16])
+    def test_quantization_shapes(self, dtype):


this seems to be a combination of two tests, one for dynamic quant one for static quant, can you use something like this:

ao/test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Line 65 in e9c7bea

@common_utils.parametrize("mode", ["dynamic", "weight-only"])

also I feel it might be better to not add static quant in this PR, and in a separate PR add both the tensor support and config support for static quant

Okay, not sure to remove static flags (although its not fully implemented) before, but small PR should be always better I feel. I will remove static_scale and all those supports.

jerryzh168 · 2025-10-24T17:44:32Z

+        if act_quant_kwargs is not None and act_quant_kwargs.static_scale is not None:
+            # INT8 × INT8 (static)
+            scale = act_quant_kwargs.static_scale
+            zero_point = torch.zeros_like(scale, dtype=torch.int8)


I think user should specify static_zero_point as well

but again, it's better to do this in a separate PR, since current state is a half of the static quant feature (no config)

jerryzh168

I think we should

split the static quant support to separate PR
follow what https://github.com/pytorch/ao/blob/main/torchao/dtypes/uintx/plain_layout.py is doing for quantized linear implementation

this should be a refactor PR, not a refactor + some extra modifications + some feature implementations I think

jerryzh168 · 2025-10-30T22:45:53Z

    aten = torch.ops.aten

-    # Unsupported case for now, this would be 1 scale per data element
+    # Per-tensor quantization (scalar scale)


is this change related?

~~It is updated to support more granularity. Without this change, we can't use per-tensor (0D scale) and per-row (1D scale).~~

above comment is incorrect and this change is unrelated; #3241

So maybe it's better to move this util function to a common place?

this can be moved to torchao/quantization/quantize_/common/utils.py I think

Okay, then I will move this to torchao/quantization/quantize_/common/utils.py after this PR.

jerryzh168

thanks, I think the tensor changes looks good, but need to make a linear_variants tests to make sure we cover different aspects of things (e.g. compile), see comments inline

can you also do a e2e perf check with https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py to make sure the performance are the same before and after change for vit model?

also add a kernel check might be useful to make sure we don't regress things:

ao/test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Line 422 in 1e473ed

def test_expected_gpu_kernel_fbgemm(self):

namgyu-youn · 2025-10-31T10:06:08Z

Updated logs:

Support 3D Tensor input using reshape, similar to Flot8Tensor
Build integrated linear variants test: introduce new int8 quantization API #3241 (comment), introduce new int8 quantization API #3241 (comment)
Add available kernel test: introduce new int8 quantization API #3241 (review)
Add PERF result in PR contents: introduce new int8 quantization API #3241 (review)

Xia-Weiwen · 2025-11-03T05:43:10Z

Hi @namgyu-youn Do you plan to submit another PR for static quantization? We also need static quantization for SmoothQuant. So, we are wondering if you have a plan or we should consider adding it ourselves. Thanks. CC @cyxlily

namgyu-youn · 2025-11-03T06:37:52Z

Hi @namgyu-youn Do you plan to submit another PR for static quantization? We also need static quantization for SmoothQuant. So, we are wondering if you have a plan or we should consider adding it ourselves. Thanks. CC @cyxlily

Yeah, static quantization support using static/dynamic flags is planned; I hope to show it to your team in the foreseeable future.

Also, in the SmoothQuant case, validating its support for the new quantization APIs (below) has higher priority, I think. Could you look into it?

W4A16-INT: Int4WeightOnlyConfig(group_size=32, version=2)
W4A16-FP: Float8WeightOnlyConfig(version=2)
W8A8-FP-dynamic: Float8DynamicActivationFloat8WeightConfig(version=2)

Xia-Weiwen · 2025-11-03T06:52:03Z

Yeah, static quantization support using static/dynamic flags is planned; I hope to show it to your team in the foreseeable future.

Thanks. Looking forward to it. If there is anything we can help with, please let us know.

Also, in the SmoothQuant case, validating its support for the new quantization APIs (below) has higher priority, I think. Could you look into it?

W4A16-INT: Int4WeightOnlyConfig(group_size=32, version=2)

W4A16-FP: Float8WeightOnlyConfig(version=2)

W8A8-FP-dynamic: Float8DynamicActivationFloat8WeightConfig(version=2)

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

namgyu-youn · 2025-11-03T07:36:21Z

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

Oh yes, it was a typo (W8A16 is right), and W4A16-INT (Int4WeightOnlyConfig(group_size=32, version=2)) is of interest. In my last experience and https://arxiv.org/html/2411.02355v3, W4A16-INT is the most efficient choice for synchronous deployments, while W8A8-INT maximize throughput in asynchronous settings.

Because current AWQ/SmoothQuant test is only working with old APIs (version 1), we can replace it with new APIs like Int4WeightOnlyConfig(group_size=32, version=2) I guess.

Xia-Weiwen · 2025-11-03T07:59:00Z

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

Oh yes, it was a typo (W8A16 is right), and W4A16-INT (Int4WeightOnlyConfig(group_size=32, version=2)) is of interest. In my last experience and https://arxiv.org/html/2411.02355v3, W4A16-INT is the most efficient choice for synchronous deployments, while W8A8-INT maximize throughput in asynchronous settings.

Because current AWQ/SmoothQuant test is only working with old APIs (version 1), we can replace it with new APIs like Int4WeightOnlyConfig(group_size=32, version=2) I guess.

I see. Thanks. We will evaluate that.

Xia-Weiwen · 2025-11-04T01:32:58Z

Hi @namgyu-youn May I know if you have a timeline to land this? Thanks.

jerryzh168 · 2025-11-04T01:44:08Z

+        kernels = {}
+
+        # Check for Triton kernels
+        if "torch.ops.triton" in code[0]:


should add some asserts I think

namgyu-youn · 2025-11-04T14:05:04Z

Updated logs:

Replace Int8Tensor to global configs in test: introduce new int8 quantization API #3241 (comment)
Consolidate linear variation tests: introduce new int8 quantization API #3241 (comment)
Update parser for kernel detection test: introduce new int8 quantization API #3241 (comment)

jerryzh168 · 2025-11-06T23:40:14Z

+        self.assertEqual(weight2.qdata, dummy.weight.qdata.narrow(1, 0, slice_sizes[1]))
+
+        # Int8DynamicActivationInt8WeightConfig uses per-row (PerRow)
+        # Int8WeightOnlyConfig uses per-tensor (PerTensor)


it should be per row I think?

ao/torchao/quantization/quant_api.py

Line 1343 in 6815e57

group_size = weight.shape[-1]

jerryzh168 · 2025-11-06T23:42:29Z

+        )
+    else:
+        assert config.version == 2, f"Unexpected version: {config.version}"
+        block_size = [weight.shape[0], weight.shape[1]]


why does this default to per tensor? I think it should follow the existing logic from L1376-1378?

namgyu-youn · 2025-11-24T10:21:10Z

@jerryzh168 Added a few TODO for not addressed comments, and I will no longer work here as we discussed at #3241 (comment); let me know if there is anything I can help.

jerryzh168

TODO:

[before landing the PR] revert changes to float8/inference.py, implement the slicing logic in the int8tensor itself

following can be after landing the PR since v2 is not used yet:

fix granularity typing to just use Granularity, follow float8 opaque tensor to handle granularity, i.e. have a normliaze_and_validate function:

ao/torchao/prototype/float8_opaque_tensor/float8_opaque_tensor.py

Line 98 in 3ad4d0a

def _normalize_and_check_granularity(

, we can split out the normalize function in the future

allow different input activation quantization settings (symmetric / asymmetric) as shown in

ao/torchao/quantization/quant_api.py

Lines 1566 to 1569 in 3ad4d0a

    
           if act_mapping_type == MappingType.SYMMETRIC: 
        
               input_quant_func = _int8_symm_per_token_reduced_range_quant 
        
           else: 
        
               input_quant_func = _int8_asymm_per_token_quant

(can include the weight only decode setting as well, but optional)

[optional] add back the

ao/torchao/quantization/quantize_/common/quantize_tensor_kwargs.py

Line 33 in 3ad4d0a

def _choose_quant_func_and_quantize_tensor(

changes and use it to quantize activation

namgyu-youn · 2025-11-25T12:16:57Z

TODO:

[before landing the PR] revert changes to float8/inference.py, implement the slicing logic in the int8tensor itself

Resolved this TODO for early land, please take a look.

jerryzh168

looks good I think, thanks @namgyu-youn!

jerryzh168 · 2025-11-25T18:54:02Z

@namgyu-youn somehow the CI is not triggered, can you rebase again so we can trigger CI?

**Summary:** Add PerBlock to safe globals so users don't have to do this themselves when they load config.json with PerBlock. ``` WeightsUnpickler error: Unsupported global: GLOBAL torchao.quantization.granularity.PerBlock was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torchao.quantization.granularity.PerBlock])` or the `torch.serialization.safe_globals([torchao.quantization.granularity.PerBlock])` context manager to allowlist this global if you trust this class/function. ``` **Test Plan:** ``` python test/core/test_config.py -k test_granularity_serialization ```

namgyu-youn · 2025-11-25T19:03:08Z

@namgyu-youn somehow the CI is not triggered, can you rebase again so we can trigger CI?

Finished rebase to trigger CI, please take a look.

jerryzh168 · 2025-11-25T20:48:55Z

somehow I still didn't see an option to run all the CI jobs, @namgyu-youn do you mind close this one and reopen another PR to see if it helps?

namgyu-youn · 2025-11-26T03:00:16Z

somehow I still didn't see an option to run all the CI jobs, @namgyu-youn do you mind close this one and reopen another PR to see if it helps?

Reland at #3391, please take a look. Also, does adding a trigger (workflow_dispatch ) to CI make sense? https://github.com/namgyu-youn/ci_test/blob/main/.github/workflows/.test_dispatch.yml showed:

With dispatch (trigger button is added)

name: Test Workflow Dispatch

on:
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - run: echo "Workflow dispatch works!"

Without dispatch (TorchAO; no trigger button):

name: Test CI for push

on:
  push:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - run: echo "LGTM"

jerryzh168 · 2025-11-26T03:42:56Z

thanks, that workflow dispatch seems only like a test, doesn't look real

jerryzh168 · 2025-11-26T03:43:30Z

OK I think it's some issue with our configs, let me check

Summary: Introduce a new tensor subclass API. The main features are Int8Tensor: Main API, which handles quantization and dequantization operations Utility operation functions: Tensor slice, index selection This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default. Related Issue/PR: #3241 (reland) Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py PERF Test: https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32: API With torch.compile Without torch.compile Old 65.47 ms 234.39 ms New 63.30 ms 239.30 ms Future Plan: #3241 (review)

Summary: Introduce a new tensor subclass API. The main features are Int8Tensor: Main API, which handles quantization and dequantization operations Utility operation functions: Tensor slice, index selection This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default. Related Issue/PR: pytorch#3241 (reland) Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py PERF Test: https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32: API With torch.compile Without torch.compile Old 65.47 ms 234.39 ms New 63.30 ms 239.30 ms Future Plan: pytorch#3241 (review)

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2025

namgyu-youn mentioned this pull request Oct 24, 2025

Add Int8Tensor for clearer interface #3038

Closed

jerryzh168 reviewed Oct 24, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated

jerryzh168 reviewed Oct 24, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated

jerryzh168 reviewed Oct 24, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 requested changes Oct 24, 2025

View reviewed changes

namgyu-youn requested a review from jerryzh168 October 26, 2025 07:29

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

namgyu-youn requested a review from jerryzh168 October 31, 2025 09:57

jerryzh168 mentioned this pull request Oct 31, 2025

Migrating from AffineQuantizedTensor + Layouts to new structure of tensor subclasses #2752

Closed

17 tasks

jerryzh168 reviewed Nov 4, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated

jerryzh168 reviewed Nov 4, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated

jerryzh168 reviewed Nov 4, 2025

View reviewed changes

namgyu-youn requested a review from jerryzh168 November 4, 2025 14:05

jerryzh168 reviewed Nov 6, 2025

View reviewed changes

namgyu-youn added 4 commits November 24, 2025 19:09

update casting logic

4757d4a

add block_size attribute, separate version 1 from 2

f1968fb

fix activation kwargs

a0d94a2

add todo for future update

3441751

namgyu-youn force-pushed the int8-quant-api branch from 2be7344 to 3441751 Compare November 24, 2025 10:09

jerryzh168 reviewed Nov 24, 2025

View reviewed changes

jerryzh168 approved these changes Nov 25, 2025

View reviewed changes

namgyu-youn force-pushed the int8-quant-api branch from b85164f to 9906ff5 Compare November 25, 2025 19:00

Merge branch 'main' into int8-quant-api

59b4333

jerryzh168 reviewed Nov 25, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/__init__.py Outdated

jerryzh168 reviewed Nov 25, 2025

View reviewed changes

Comment thread torchao/float8/inference.py

namgyu-youn and others added 5 commits November 26, 2025 10:58

empty commit to trigger CI

e940d2b

Move float8_opaque_tensor to prototype (pytorch#3365)

fb29f5e

revert scale slicing, implement slicing logic for Int8Tensor directly

aabab76

drop unrelated commit

0760eac

fix after rebase

d83f615

namgyu-youn mentioned this pull request Nov 26, 2025

Introduce int8 quantization api (version 2) #3391

Merged

namgyu-youn closed this Nov 26, 2025

namgyu-youn deleted the int8-quant-api branch November 26, 2025 03:44

	if act_mapping_type == MappingType.SYMMETRIC:
	input_quant_func = _int8_symm_per_token_reduced_range_quant
	else:
	input_quant_func = _int8_asymm_per_token_quant

Conversation

namgyu-youn commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3241

❗ 2 Active SEVs

Uh oh!

Uh oh!

jerryzh168 Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Nov 3, 2025

Uh oh!

Xia-Weiwen commented Nov 3, 2025

Uh oh!

namgyu-youn commented Nov 3, 2025

Uh oh!

Xia-Weiwen commented Nov 3, 2025

Uh oh!

Xia-Weiwen commented Nov 4, 2025

Uh oh!

Uh oh!

Uh oh!

jerryzh168 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

namgyu-youn commented Oct 24, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 24, 2025 •

edited

Loading

jerryzh168 Oct 24, 2025 •

edited

Loading

jerryzh168 Oct 24, 2025 •

edited

Loading

namgyu-youn Oct 31, 2025 •

edited

Loading

jerryzh168 left a comment •

edited

Loading

namgyu-youn commented Oct 31, 2025 •

edited

Loading

Xia-Weiwen commented Nov 3, 2025 •

edited

Loading

namgyu-youn commented Nov 4, 2025 •

edited

Loading

jerryzh168 Nov 6, 2025 •

edited

Loading

jerryzh168 Nov 6, 2025 •

edited

Loading

namgyu-youn commented Nov 24, 2025 •

edited

Loading

jerryzh168 left a comment •

edited

Loading

namgyu-youn commented Nov 25, 2025 •

edited

Loading

namgyu-youn commented Nov 26, 2025 •

edited

Loading