Enable pt2e quantization path for arm by choudhary-devang · Pull Request #146690 · pytorch/pytorch

choudhary-devang · 2025-02-07T09:23:48Z

Title: Enable PyTorch 2 Export Quantization path for ARM CPUs.

Description:

This PR extends the PyTorch 2 Export Quantization (PT2E Quantization) workflow—originally available only on x86 CPUs—to support ARM platforms. PT2E Quantization is an automated, full-graph quantization solution in PyTorch that improves on Eager Mode Quantization by adding support for functionals and automating the overall process. It is part of the torch.ao module and fully supports quantization when using the compile mode.

Key Changes:

Introduces ARM-specific support by leveraging oneDNN kernels for matmuls and convolution.
Integrates pre-defined configuration selection to automatically choose the best quantization settings based on the selected quantization method.

Provides customization options via two flags:

qat_state: Indicates whether to use Quantization Aware Training (if set to True) or Post Training Quantization (if set to False). The default remains False.
dynamic_state: Selects between dynamic quantization (if True) and static quantization (if False). The default is also set to False.

These options allow users to tailor the quantization process for their specific workload requirements (e.g., using QAT for fine-tuning or PTQ for calibration-based quantization).

Testing and Validation:

The new ARM flow has been thoroughly tested across a range of models with all combinations:
NLP: Models such as BERT and T5.
Vision: Models like ResNet and ViT.
Custom Models: user defined models with various operators.

example script:

import torch
import torchvision.models as models
from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
import torch.ao.quantization.quantizer.arm_inductor_quantizer as armiq
from torch.ao.quantization.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
from torch.profiler import profile, record_function, ProfilerActivity

model_name = "resnet50"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 500
x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
example_inputs = (x,)

with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    quantizer = armiq.ArmInductorQuantizer()
    quantizer.set_global(armiq.get_default_arm_inductor_quantization_config(is_dynamic=False))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)

    with torch.set_grad_enabled(False):
        for _ in range(50):
            converted_model(*example_inputs) #Warmup
        print("Warmup over")
        with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
            with record_function("model_inference"):
                for _ in range(100):
                    converted_model(*example_inputs)

    print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

pytorch-bot · 2025-02-07T09:23:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146690

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 36325da with merge base 4854926 ():

NEW FAILURE - The following job has failed:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
REGRESSION: benchmark ('symint_sum_loop', 'compile_time_instruction_count') failed, actual result 11160201125 is 169.44% higher than expected 4142000000 ±+1.50% if this is an expected regression, please update the expected results.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

maajidkhann · 2025-02-07T09:28:48Z

@pytorchbot label "module: arm"

maajidkhann · 2025-02-07T09:31:02Z

@pytorchbot label "module: cpu"

choudhary-devang · 2025-02-07T09:31:37Z

@jerryzh168 can you please review this pr, thankyou.

maajidkhann · 2025-02-07T09:34:29Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2025-02-07T09:34:37Z

To add these label(s) (ciflow/linux-aarch64) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

maajidkhann · 2025-02-07T09:38:20Z

@jerryzh168 can you please review this pr, thankyou.

cc @digantdesai @jianyuh @malfet

jerryzh168

thanks, the quantizer can be owned by ARM I think, so LGTM. can you add some tests, similar to https://github.com/pytorch/pytorch/blob/main/test/quantization/pt2e/test_x86inductor_quantizer.py ?

choudhary-devang · 2025-02-12T04:50:58Z

Hi @jerryzh168, thanks for the quick response. I added the tests for the arm_inductor_quantizer config. can you add the label "ciflow/linux-aarch64" and trigger the CI pipelines

pytorch-bot · 2025-02-12T08:54:27Z

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jerryzh168 · 2025-02-12T08:55:52Z

are these needed for pt2e quant stack? I feel these are only needed for the older fx stack. cc @Xia-Weiwen

Yes, I think it's only needed for old stacks.

Same for qconfig.

leslie-fang-intel · 2025-02-12T10:42:23Z

Why this code changes are needed for PT2E quantization?

choudhary-devang · 2025-02-14T08:26:21Z

Hi @huydhn, when i am trying to reply to a comment then it showing me the pending label

so the other reviewers are not able to see my comments, can you help me with this.

choudhary-devang · 2025-02-14T08:32:42Z

Hi @jerryzh168 , @Xia-Weiwen, replay for this (#146690 (comment))

To integrate the skipIfNoArm decorator into the test file, I defined it in a way similar to skipIfNoX86. I then added ARM as a qengine in torch/backends/quantized/__init__.py. In that file, I found a note stating, "This function should correspond to the enums present in c10/core/QEngine.h," so I updated c10/core/QEngine.h accordingly.

Additionally, if the qconfig change was not done then, the system defaults to the "x86" configuration, which leads to an error when we use the ARM configuration.

choudhary-devang · 2025-02-14T08:34:29Z

Hi @leslie-fang-intel, replay to this comment (#146690 (comment))

To set the default qconfig as arm just on arm platform. if we don't set it and if backend variable is not passed in get_default_qconfig() then the function will select x86 as default config.

milpuz01 · 2025-04-03T22:02:56Z

@choudhary-devang @jerryzh168 In ExecuTorch there is already Arm quantiser (https://github.com/pytorch/executorch/blob/main/backends/arm/quantizer/arm_quantizer.py) that is using TOSA as backend for quantization in order to target devices such as Ethos-U. I was wondering whether we can rename this quantiser to be onednn_inductor_quantizer.py as there is lot of commonality with x86 quantiser that is targeting CPU via inductor path and that path is leveraging oneDNN for efficient code?

(cc: @digantdesai @freddan80)

jerryzh168 · 2025-04-03T22:13:10Z

@choudhary-devang @jerryzh168 In ExecuTorch there is already Arm quantiser (pytorch/executorch@main/backends/arm/quantizer/arm_quantizer.py) that is using TOSA as backend for quantization in order to target devices such as Ethos-U. I was wondering whether we can rename this quantiser to be onednn_inductor_quantizer.py as there is lot of commonality with x86 quantiser that is targeting CPU via inductor path and that path is leveraging oneDNN for efficient code?

(cc: @digantdesai @freddan80)

is ARM ops just (1) a different implementation of onednn ops, or (2) will they be using different hardware instructions and target different hardwares? I think we can merge into onednn if it's (1), but we should have a separate quantizer if it's (2),

even with (2) you can compose with onednn quantizer with composable_quantizer: https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantizer/composable_quantizer.py and use one quantizer to quantize one part of the model and the other quantizer to quantize the other part

freddan80 · 2025-04-04T07:30:53Z

is ARM ops just (1) a different implementation of onednn ops, or (2) will they be using different hardware instructions and target different hardwares?

@jerryzh168 Hello, good to e-meet you! 2. This quantizer is is for Arm NPUs.

I agree with @milpuz01, we should consider changing the name. Having an ArmQuantizer and an ArmInductorQuantizer will confuse ppl in the community I think. There's also XNN-pack, which has its own ´XNNpackQuantizer´, which also supports Cortex-A CPUs... Hence having OneDNN in the quantizer name, path etc. would make most sense to me.

Perhaps there should be a naming convention for quantizers :)

@digantdesai your thought on this?

…tor_quantizer and x86_inductor_quantizer

… qspec.

This reverts commit 4db960a.

fadara01 · 2025-04-04T11:27:04Z

is ARM ops just (1) a different implementation of onednn ops

@jerryzh168, for this path, Arm and Intel basically share the same high level API which is oneDNN. The same mkldnn/onednn lowerings in inductor are shared between aarch64 and x86.
The main different between the x86, and arm quantizer in this PR is that they use different quantization configs (e.g. s8 instead of u8 activations, and per_tensor rather than per_channel weights for Arm, because these are the configs we have optimised implementations for through oneDNN/ACL).

Having said that, I think we fall into case (1)

freddan80 · 2025-04-04T12:27:01Z

Perhaps I read the question wrong. To clarify.

ArmQuantizer: For NPUs
arm_inductor_quatizer (this PR) as @fadara01 point out is for Arm CPU ops behind OneDNN API's IIUC.

jerryzh168 · 2025-04-04T17:33:39Z

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.

I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

Xia-Weiwen · 2025-04-07T02:40:47Z

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.

I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

leslie-fang-intel · 2025-04-07T02:59:43Z

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

jerryzh168 · 2025-04-10T00:40:35Z

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

@leslie-fang-intel so what should name be if we add ARM CPU support on top of x86 CPU?

maybe ServerCPU?

choudhary-devang · 2025-04-21T05:53:54Z

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

@leslie-fang-intel so what should name be if we add ARM CPU support on top of x86 CPU?

maybe ServerCPU?

Hi @jerryzh168
As @fadara01 already mentioned above, we use few different quantization configs on ARM
compared to x86 because these configs have optimised implementations for ARM using oneDNN/ACL.
Now, I also plan to introduce further new configs and patterns for INT8 specifically for ARM
in my future PR's and these might not be applicable to x86.

So, I was thinking we can have a seperate ARM quantizer (arm_inductor_quantizer.py) like how
it is this currrent PR instead of merging it into common one for ease of maintainibility and also
for reasons mentioned above by @leslie-fang-intel

jerryzh168 · 2025-04-21T17:35:31Z

@choudhary-devang OK that sounds good, we just copy pasted the pt2e quant code to torchao, could you reopen this PR in torchao instead? https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e

freddan80 · 2025-04-22T07:33:21Z

So, I was thinking we can have a seperate ARM quantizer (arm_inductor_quantizer.py) like how
it is this currrent PR instead of merging it into common one for ease of maintainibility and also
for reasons mentioned above by @leslie-fang-intel

My only concern is about naming. I think OneDNN should be there in the name somehow, or there'll be confusion. For example, XNNpack has its XNNpackQuantizer, which runs on Arm CPU's. To align with that naming convention, OneDNN should be in the name imo. Naming is hard - I do think arm_inductor_quantizer and arm_quantizer will be mixed up and cause confusion.

choudhary-devang · 2025-04-28T09:07:30Z

@choudhary-devang OK that sounds good, we just copy pasted the pt2e quant code to torchao, could you reopen this PR in torchao instead? https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e

Hi @jerryzh168, I have created a new pr as requested in torchao
pytorch/ao#2139

github-actions · 2025-06-27T09:37:22Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot Bot added release notes: quantization release notes category release notes: AO frontend labels Feb 7, 2025

choudhary-devang mentioned this pull request Feb 7, 2025

Enable fx_quantization for arm #143740

Closed

pytorch-bot Bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Feb 7, 2025

pytorch-bot Bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Feb 7, 2025

pytorchbot added the open source label Feb 7, 2025

mikaylagawarecki requested review from XuehaiPan and jerryzh168 and removed request for XuehaiPan February 7, 2025 20:05

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025

jerryzh168 approved these changes Feb 8, 2025

View reviewed changes

choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 0774646 to 13176d6 Compare February 12, 2025 04:46

jerryzh168 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 12, 2025

pytorch-bot Bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 12, 2025

jerryzh168 reviewed Feb 12, 2025

View reviewed changes

jerryzh168 requested review from Xia-Weiwen and leslie-fang-intel February 12, 2025 08:56

leslie-fang-intel reviewed Feb 12, 2025

View reviewed changes

fadara01 approved these changes Apr 2, 2025

View reviewed changes

choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 23338ae to 2f26b4d Compare April 2, 2025 09:31

choudhary-devang added 10 commits April 4, 2025 14:35

Enable pt2e quantization path for arm

07f57b9

added the test_setup

f969b4e

removed the QEngine and Qconfig changes

6ad7267

removed trailing line

3e046f4

added onednn_inductor_quantizer to remove redundant code in arm_induc…

d31d0a6

…tor_quantizer and x86_inductor_quantizer

changed function names accroding the test_public_binding conventions

1f8de39

removed onednn_inductor_quantizer insted use x86_inductor_quantizer

95b1391

add the docs/source changes and adjusted the testcases for per_tensor…

8b12f93

… qspec.

removed unrequired docs changes

4db960a

Revert "removed unrequired docs changes"

36325da

This reverts commit 4db960a.

choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 2f26b4d to 36325da Compare April 4, 2025 09:47

choudhary-devang mentioned this pull request Apr 28, 2025

Arm_inductor_quantizer for Pt2e quantization pytorch/ao#2139

Merged

github-actions Bot added the Stale label Jun 27, 2025

github-actions Bot closed this Jul 27, 2025

Conversation

choudhary-devang commented Feb 7, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146690

❌ 1 New Failure

Uh oh!

maajidkhann commented Feb 7, 2025

Uh oh!

maajidkhann commented Feb 7, 2025

Uh oh!

choudhary-devang commented Feb 7, 2025

Uh oh!

maajidkhann commented Feb 7, 2025

Uh oh!

pytorch-bot Bot commented Feb 7, 2025

Uh oh!

maajidkhann commented Feb 7, 2025

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

choudhary-devang commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 12, 2025

Uh oh!

jerryzh168 Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

choudhary-devang commented Feb 14, 2025

Uh oh!

choudhary-devang commented Feb 14, 2025

Uh oh!

choudhary-devang commented Feb 14, 2025

Uh oh!

milpuz01 commented Apr 3, 2025

Uh oh!

jerryzh168 commented Apr 3, 2025

Uh oh!

freddan80 commented Apr 4, 2025

Uh oh!

fadara01 commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

freddan80 commented Apr 4, 2025

Uh oh!

jerryzh168 commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Apr 7, 2025

Uh oh!

leslie-fang-intel commented Apr 7, 2025

Uh oh!

jerryzh168 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choudhary-devang commented Apr 21, 2025

Uh oh!

jerryzh168 commented Apr 21, 2025

Uh oh!

freddan80 commented Apr 22, 2025

Uh oh!

choudhary-devang commented Apr 28, 2025

Uh oh!

github-actions Bot commented Jun 27, 2025

Uh oh!

choudhary-devang commented Feb 7, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Feb 7, 2025 •

edited

Loading

choudhary-devang commented Feb 12, 2025 •

edited

Loading

jerryzh168 Feb 12, 2025 •

edited

Loading

fadara01 commented Apr 4, 2025 •

edited

Loading

jerryzh168 commented Apr 4, 2025 •

edited

Loading

jerryzh168 commented Apr 10, 2025 •

edited

Loading