Skip to content

introduce new int8 quantization API#3241

Closed
namgyu-youn wants to merge 27 commits intopytorch:mainfrom
namgyu-youn:int8-quant-api
Closed

introduce new int8 quantization API#3241
namgyu-youn wants to merge 27 commits intopytorch:mainfrom
namgyu-youn:int8-quant-api

Conversation

@namgyu-youn
Copy link
Copy Markdown
Contributor

@namgyu-youn namgyu-youn commented Oct 24, 2025

Summary:
Introduce a new tensor subclass API. The main features are

  • Int8Tensor: Main API, which handles quantization and dequantization operations
  • Utility operation functions: Tensor slice, index selection

This api is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default.

Related Issue/PR: #3038 (reland)

Test plan:
test/quantization/quantize_/workflows/int8/test_int8_tensor.py

Performance:
The following are the results of https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32:

API With torch.compile Without torch.compile
Old 65.47 ms 234.39 ms
New 63.30 ms 239.30 ms

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3241

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2025
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
)

@common_utils.parametrize("dtype", [torch.bfloat16, torch.float16])
def test_quantization_shapes(self, dtype):
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a combination of two tests, one for dynamic quant one for static quant, can you use something like this:

@common_utils.parametrize("mode", ["dynamic", "weight-only"])

also I feel it might be better to not add static quant in this PR, and in a separate PR add both the tensor support and config support for static quant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, not sure to remove static flags (although its not fully implemented) before, but small PR should be always better I feel. I will remove static_scale and all those supports.

if act_quant_kwargs is not None and act_quant_kwargs.static_scale is not None:
# INT8 × INT8 (static)
scale = act_quant_kwargs.static_scale
zero_point = torch.zeros_like(scale, dtype=torch.int8)
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think user should specify static_zero_point as well

but again, it's better to do this in a separate PR, since current state is a half of the static quant feature (no config)

Comment thread torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated
Comment thread torchao/quantization/quantize_/workflows/int8/int8_tensor.py
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should

  1. split the static quant support to separate PR
  2. follow what https://github.com/pytorch/ao/blob/main/torchao/dtypes/uintx/plain_layout.py is doing for quantized linear implementation

this should be a refactor PR, not a refactor + some extra modifications + some feature implementations I think

Comment thread torchao/float8/inference.py Outdated
aten = torch.ops.aten

# Unsupported case for now, this would be 1 scale per data element
# Per-tensor quantization (scalar scale)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related?

Copy link
Copy Markdown
Contributor Author

@namgyu-youn namgyu-youn Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is updated to support more granularity. Without this change, we can't use per-tensor (0D scale) and per-row (1D scale).

above comment is incorrect and this change is unrelated; #3241

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe it's better to move this util function to a common place?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be moved to torchao/quantization/quantize_/common/utils.py I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then I will move this to torchao/quantization/quantize_/common/utils.py after this PR.

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think the tensor changes looks good, but need to make a linear_variants tests to make sure we cover different aspects of things (e.g. compile), see comments inline

can you also do a e2e perf check with https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py to make sure the performance are the same before and after change for vit model?

also add a kernel check might be useful to make sure we don't regress things:

def test_expected_gpu_kernel_fbgemm(self):

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Oct 31, 2025

Updated logs:

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

Xia-Weiwen commented Nov 3, 2025

Hi @namgyu-youn Do you plan to submit another PR for static quantization? We also need static quantization for SmoothQuant. So, we are wondering if you have a plan or we should consider adding it ourselves. Thanks. CC @cyxlily

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

Hi @namgyu-youn Do you plan to submit another PR for static quantization? We also need static quantization for SmoothQuant. So, we are wondering if you have a plan or we should consider adding it ourselves. Thanks. CC @cyxlily

Yeah, static quantization support using static/dynamic flags is planned; I hope to show it to your team in the foreseeable future.

Also, in the SmoothQuant case, validating its support for the new quantization APIs (below) has higher priority, I think. Could you look into it?

  • W4A16-INT: Int4WeightOnlyConfig(group_size=32, version=2)
  • W4A16-FP: Float8WeightOnlyConfig(version=2)
  • W8A8-FP-dynamic: Float8DynamicActivationFloat8WeightConfig(version=2)

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

Yeah, static quantization support using static/dynamic flags is planned; I hope to show it to your team in the foreseeable future.

Thanks. Looking forward to it. If there is anything we can help with, please let us know.

Also, in the SmoothQuant case, validating its support for the new quantization APIs (below) has higher priority, I think. Could you look into it?

  • W4A16-INT: Int4WeightOnlyConfig(group_size=32, version=2)
  • W4A16-FP: Float8WeightOnlyConfig(version=2)
  • W8A8-FP-dynamic: Float8DynamicActivationFloat8WeightConfig(version=2)

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

Oh yes, it was a typo (W8A16 is right), and W4A16-INT (Int4WeightOnlyConfig(group_size=32, version=2)) is of interest. In my last experience and https://arxiv.org/html/2411.02355v3, W4A16-INT is the most efficient choice for synchronous deployments, while W8A8-INT maximize throughput in asynchronous settings.

Because current AWQ/SmoothQuant test is only working with old APIs (version 1), we can replace it with new APIs like Int4WeightOnlyConfig(group_size=32, version=2) I guess.

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

By "validating them", do you mean adding test cases? And are W4A16 and W8A16 (I guess there is a typo in your comment) really needed for SmoothQuant? For W4A16 , it would be much the same as AWQ. And for W8A16, I think accuracy is generally good enough without SmoothQuant.

Oh yes, it was a typo (W8A16 is right), and W4A16-INT (Int4WeightOnlyConfig(group_size=32, version=2)) is of interest. In my last experience and https://arxiv.org/html/2411.02355v3, W4A16-INT is the most efficient choice for synchronous deployments, while W8A8-INT maximize throughput in asynchronous settings.

Because current AWQ/SmoothQuant test is only working with old APIs (version 1), we can replace it with new APIs like Int4WeightOnlyConfig(group_size=32, version=2) I guess.

I see. Thanks. We will evaluate that.

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

Hi @namgyu-youn May I know if you have a timeline to land this? Thanks.

Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
kernels = {}

# Check for Triton kernels
if "torch.ops.triton" in code[0]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add some asserts I think

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Nov 4, 2025

Updated logs:

self.assertEqual(weight2.qdata, dummy.weight.qdata.narrow(1, 0, slice_sizes[1]))

# Int8DynamicActivationInt8WeightConfig uses per-row (PerRow)
# Int8WeightOnlyConfig uses per-tensor (PerTensor)
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be per row I think?

group_size = weight.shape[-1]

Comment thread torchao/quantization/quant_api.py Outdated
)
else:
assert config.version == 2, f"Unexpected version: {config.version}"
block_size = [weight.shape[0], weight.shape[1]]
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this default to per tensor? I think it should follow the existing logic from L1376-1378?

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Nov 24, 2025

@jerryzh168 Added a few TODO for not addressed comments, and I will no longer work here as we discussed at #3241 (comment); let me know if there is anything I can help.

Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  1. [before landing the PR] revert changes to float8/inference.py, implement the slicing logic in the int8tensor itself

following can be after landing the PR since v2 is not used yet:

  1. fix granularity typing to just use Granularity, follow float8 opaque tensor to handle granularity, i.e. have a normliaze_and_validate function:
    def _normalize_and_check_granularity(
    , we can split out the normalize function in the future
  2. allow different input activation quantization settings (symmetric / asymmetric) as shown in
    if act_mapping_type == MappingType.SYMMETRIC:
    input_quant_func = _int8_symm_per_token_reduced_range_quant
    else:
    input_quant_func = _int8_asymm_per_token_quant
    (can include the weight only decode setting as well, but optional)
  3. [optional] add back the
    def _choose_quant_func_and_quantize_tensor(
    changes and use it to quantize activation

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Nov 25, 2025

TODO:

  1. [before landing the PR] revert changes to float8/inference.py, implement the slicing logic in the int8tensor itself

Resolved this TODO for early land, please take a look.

Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good I think, thanks @namgyu-youn!

@jerryzh168
Copy link
Copy Markdown
Contributor

@namgyu-youn somehow the CI is not triggered, can you rebase again so we can trigger CI?

**Summary:** Add PerBlock to safe globals so users don't have
to do this themselves when they load config.json with PerBlock.

```
WeightsUnpickler error: Unsupported global: GLOBAL torchao.quantization.granularity.PerBlock was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torchao.quantization.granularity.PerBlock])` or the `torch.serialization.safe_globals([torchao.quantization.granularity.PerBlock])` context manager to allowlist this global if you trust this class/function.
```

**Test Plan:**
```
python test/core/test_config.py -k test_granularity_serialization
```
@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@namgyu-youn somehow the CI is not triggered, can you rebase again so we can trigger CI?

Finished rebase to trigger CI, please take a look.

Comment thread torchao/quantization/quantize_/workflows/__init__.py Outdated
Comment thread torchao/float8/inference.py
@jerryzh168
Copy link
Copy Markdown
Contributor

somehow I still didn't see an option to run all the CI jobs, @namgyu-youn do you mind close this one and reopen another PR to see if it helps?

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Nov 26, 2025

somehow I still didn't see an option to run all the CI jobs, @namgyu-youn do you mind close this one and reopen another PR to see if it helps?

Reland at #3391, please take a look. Also, does adding a trigger (workflow_dispatch ) to CI make sense? https://github.com/namgyu-youn/ci_test/blob/main/.github/workflows/.test_dispatch.yml showed:

  1. With dispatch (trigger button is added)
name: Test Workflow Dispatch

on:
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - run: echo "Workflow dispatch works!"
image
  1. Without dispatch (TorchAO; no trigger button):
name: Test CI for push

on:
  push:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - run: echo "LGTM"
image

@jerryzh168
Copy link
Copy Markdown
Contributor

thanks, that workflow dispatch seems only like a test, doesn't look real

@jerryzh168
Copy link
Copy Markdown
Contributor

OK I think it's some issue with our configs, let me check

@namgyu-youn namgyu-youn deleted the int8-quant-api branch November 26, 2025 03:44
jcaip pushed a commit that referenced this pull request Dec 2, 2025
Summary:
Introduce a new tensor subclass API. The main features are

Int8Tensor: Main API, which handles quantization and dequantization operations
Utility operation functions: Tensor slice, index selection
This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default.

Related Issue/PR: #3241 (reland)

Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py

PERF Test:
https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32:

API	With torch.compile	Without torch.compile
Old	65.47 ms	234.39 ms
New	63.30 ms	239.30 ms
Future Plan: #3241 (review)
namgyu-youn added a commit to namgyu-youn/ao that referenced this pull request Dec 19, 2025
Summary:
Introduce a new tensor subclass API. The main features are

Int8Tensor: Main API, which handles quantization and dequantization operations
Utility operation functions: Tensor slice, index selection
This API is integrated into global variants (Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig) using version, and not defined as a default.

Related Issue/PR: pytorch#3241 (reland)

Test plan: pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py

PERF Test:
https://github.com/pytorch/ao/blob/main/tutorials/quantize_vit/run_vit_b_quant.py with a batch size of 32:

API	With torch.compile	Without torch.compile
Old	65.47 ms	234.39 ms
New	63.30 ms	239.30 ms
Future Plan: pytorch#3241 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants