Skip to content

Enable pt2e quantization path for arm#146690

Closed
choudhary-devang wants to merge 10 commits into
pytorch:mainfrom
choudhary-devang:devang/pt2e_quantization_arm
Closed

Enable pt2e quantization path for arm#146690
choudhary-devang wants to merge 10 commits into
pytorch:mainfrom
choudhary-devang:devang/pt2e_quantization_arm

Conversation

@choudhary-devang

@choudhary-devang choudhary-devang commented Feb 7, 2025

Copy link
Copy Markdown

Title: Enable PyTorch 2 Export Quantization path for ARM CPUs.

Description:

  • This PR extends the PyTorch 2 Export Quantization (PT2E Quantization) workflow—originally available only on x86 CPUs—to support ARM platforms. PT2E Quantization is an automated, full-graph quantization solution in PyTorch that improves on Eager Mode Quantization by adding support for functionals and automating the overall process. It is part of the torch.ao module and fully supports quantization when using the compile mode.

Key Changes:

  • Introduces ARM-specific support by leveraging oneDNN kernels for matmuls and convolution.

  • Integrates pre-defined configuration selection to automatically choose the best quantization settings based on the selected quantization method.

Provides customization options via two flags:

  • qat_state: Indicates whether to use Quantization Aware Training (if set to True) or Post Training Quantization (if set to False). The default remains False.
  • dynamic_state: Selects between dynamic quantization (if True) and static quantization (if False). The default is also set to False.
    Screenshot 2025-01-22 105543

These options allow users to tailor the quantization process for their specific workload requirements (e.g., using QAT for fine-tuning or PTQ for calibration-based quantization).

Testing and Validation:

The new ARM flow has been thoroughly tested across a range of models with all combinations:
NLP: Models such as BERT and T5.
Vision: Models like ResNet and ViT.
Custom Models: user defined models with various operators.

example script:

import torch
import torchvision.models as models
from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
import torch.ao.quantization.quantizer.arm_inductor_quantizer as armiq
from torch.ao.quantization.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
from torch.profiler import profile, record_function, ProfilerActivity

model_name = "resnet50"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 500
x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
example_inputs = (x,)

with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    quantizer = armiq.ArmInductorQuantizer()
    quantizer.set_global(armiq.get_default_arm_inductor_quantization_config(is_dynamic=False))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)

    with torch.set_grad_enabled(False):
        for _ in range(50):
            converted_model(*example_inputs) #Warmup
        print("Warmup over")
        with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
            with record_function("model_inference"):
                for _ in range(100):
                    converted_model(*example_inputs)

    print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

@pytorch-bot

pytorch-bot Bot commented Feb 7, 2025

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146690

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 36325da with merge base 4854926 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@maajidkhann

Copy link
Copy Markdown
Contributor

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot Bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Feb 7, 2025
@maajidkhann

Copy link
Copy Markdown
Contributor

@pytorchbot label "module: cpu"

@pytorch-bot pytorch-bot Bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Feb 7, 2025
@choudhary-devang

Copy link
Copy Markdown
Author

@jerryzh168 can you please review this pr, thankyou.

@maajidkhann

Copy link
Copy Markdown
Contributor

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot

pytorch-bot Bot commented Feb 7, 2025

Copy link
Copy Markdown

To add these label(s) (ciflow/linux-aarch64) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@maajidkhann

Copy link
Copy Markdown
Contributor

@jerryzh168 can you please review this pr, thankyou.

cc @digantdesai @jianyuh @malfet

@mikaylagawarecki mikaylagawarecki requested review from XuehaiPan and jerryzh168 and removed request for XuehaiPan February 7, 2025 20:05
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025

@jerryzh168 jerryzh168 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, the quantizer can be owned by ARM I think, so LGTM. can you add some tests, similar to https://github.com/pytorch/pytorch/blob/main/test/quantization/pt2e/test_x86inductor_quantizer.py ?

@choudhary-devang choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 0774646 to 13176d6 Compare February 12, 2025 04:46
@choudhary-devang

choudhary-devang commented Feb 12, 2025

Copy link
Copy Markdown
Author

Hi @jerryzh168, thanks for the quick response. I added the tests for the arm_inductor_quantizer config. can you add the label "ciflow/linux-aarch64" and trigger the CI pipelines

@jerryzh168 jerryzh168 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 12, 2025
@pytorch-bot

pytorch-bot Bot commented Feb 12, 2025

Copy link
Copy Markdown

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot Bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 12, 2025
Comment thread c10/core/QEngine.h Outdated

@jerryzh168 jerryzh168 Feb 12, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these needed for pt2e quant stack? I feel these are only needed for the older fx stack. cc @Xia-Weiwen

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's only needed for old stacks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for qconfig.

Comment thread torch/ao/quantization/qconfig.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this code changes are needed for PT2E quantization?

@choudhary-devang

Copy link
Copy Markdown
Author

Hi @huydhn, when i am trying to reply to a comment then it showing me the pending label
image

image

so the other reviewers are not able to see my comments, can you help me with this.

@choudhary-devang

Copy link
Copy Markdown
Author

Hi @jerryzh168 , @Xia-Weiwen, replay for this (#146690 (comment))

To integrate the skipIfNoArm decorator into the test file, I defined it in a way similar to skipIfNoX86. I then added ARM as a qengine in torch/backends/quantized/__init__.py. In that file, I found a note stating, "This function should correspond to the enums present in c10/core/QEngine.h," so I updated c10/core/QEngine.h accordingly.

Additionally, if the qconfig change was not done then, the system defaults to the "x86" configuration, which leads to an error when we use the ARM configuration.

@choudhary-devang

Copy link
Copy Markdown
Author

Hi @leslie-fang-intel, replay to this comment (#146690 (comment))

To set the default qconfig as arm just on arm platform. if we don't set it and if backend variable is not passed in get_default_qconfig() then the function will select x86 as default config.

@choudhary-devang choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 23338ae to 2f26b4d Compare April 2, 2025 09:31
@milpuz01

milpuz01 commented Apr 3, 2025

Copy link
Copy Markdown
Contributor

@choudhary-devang @jerryzh168 In ExecuTorch there is already Arm quantiser (https://github.com/pytorch/executorch/blob/main/backends/arm/quantizer/arm_quantizer.py) that is using TOSA as backend for quantization in order to target devices such as Ethos-U. I was wondering whether we can rename this quantiser to be onednn_inductor_quantizer.py as there is lot of commonality with x86 quantiser that is targeting CPU via inductor path and that path is leveraging oneDNN for efficient code?

(cc: @digantdesai @freddan80)

@jerryzh168

Copy link
Copy Markdown
Contributor

@choudhary-devang @jerryzh168 In ExecuTorch there is already Arm quantiser (pytorch/executorch@main/backends/arm/quantizer/arm_quantizer.py) that is using TOSA as backend for quantization in order to target devices such as Ethos-U. I was wondering whether we can rename this quantiser to be onednn_inductor_quantizer.py as there is lot of commonality with x86 quantiser that is targeting CPU via inductor path and that path is leveraging oneDNN for efficient code?

(cc: @digantdesai @freddan80)

is ARM ops just (1) a different implementation of onednn ops, or (2) will they be using different hardware instructions and target different hardwares? I think we can merge into onednn if it's (1), but we should have a separate quantizer if it's (2),

even with (2) you can compose with onednn quantizer with composable_quantizer: https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantizer/composable_quantizer.py and use one quantizer to quantize one part of the model and the other quantizer to quantize the other part

@freddan80

Copy link
Copy Markdown

is ARM ops just (1) a different implementation of onednn ops, or (2) will they be using different hardware instructions and target different hardwares?

@jerryzh168 Hello, good to e-meet you! 2. This quantizer is is for Arm NPUs.

I agree with @milpuz01, we should consider changing the name. Having an ArmQuantizer and an ArmInductorQuantizer will confuse ppl in the community I think. There's also XNN-pack, which has its own ´XNNpackQuantizer´, which also supports Cortex-A CPUs... Hence having OneDNN in the quantizer name, path etc. would make most sense to me.

Perhaps there should be a naming convention for quantizers :)

@digantdesai your thought on this?

@choudhary-devang choudhary-devang force-pushed the devang/pt2e_quantization_arm branch from 2f26b4d to 36325da Compare April 4, 2025 09:47
@fadara01

fadara01 commented Apr 4, 2025

Copy link
Copy Markdown
Collaborator

is ARM ops just (1) a different implementation of onednn ops

@jerryzh168, for this path, Arm and Intel basically share the same high level API which is oneDNN. The same mkldnn/onednn lowerings in inductor are shared between aarch64 and x86.
The main different between the x86, and arm quantizer in this PR is that they use different quantization configs (e.g. s8 instead of u8 activations, and per_tensor rather than per_channel weights for Arm, because these are the configs we have optimised implementations for through oneDNN/ACL).

Having said that, I think we fall into case (1)

@freddan80

Copy link
Copy Markdown

Perhaps I read the question wrong. To clarify.

ArmQuantizer: For NPUs
arm_inductor_quatizer (this PR) as @fadara01 point out is for Arm CPU ops behind OneDNN API's IIUC.

@jerryzh168

jerryzh168 commented Apr 4, 2025

Copy link
Copy Markdown
Contributor

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.

I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

@Xia-Weiwen

Copy link
Copy Markdown
Collaborator

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.

I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

@leslie-fang-intel

Copy link
Copy Markdown
Collaborator

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

@jerryzh168

jerryzh168 commented Apr 10, 2025

Copy link
Copy Markdown
Contributor

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

@leslie-fang-intel so what should name be if we add ARM CPU support on top of x86 CPU?

maybe ServerCPU?

@choudhary-devang

Copy link
Copy Markdown
Author

@freddan80 nice to meet you as well, also thanks for clarifications @fadara01.
I thought oneDNN is just for intel cpu, in that case I think it will be better to merge into the existing X86InductorQuantizer (and should probably rename this to OnednnQuantizer), in general it can be per backend library I think, like fbgemm, onednn etc.

cc @leslie-fang-intel about the renaming suggestion (X86InductorQuantizer -> OnednnQuantizer)

Since for the backend optimization of X86InductorQuantizer, we will leverage both oneDNN primitive, GEMM Template with X86 intrinsic and Inductor CPP Backend codegen, feels like OnednnQuantizer may not be as intuitive as X86InductorQuantizer.

@leslie-fang-intel so what should name be if we add ARM CPU support on top of x86 CPU?

maybe ServerCPU?

Hi @jerryzh168
As @fadara01 already mentioned above, we use few different quantization configs on ARM
compared to x86 because these configs have optimised implementations for ARM using oneDNN/ACL.
Now, I also plan to introduce further new configs and patterns for INT8 specifically for ARM
in my future PR's and these might not be applicable to x86.

So, I was thinking we can have a seperate ARM quantizer (arm_inductor_quantizer.py) like how
it is this currrent PR instead of merging it into common one for ease of maintainibility and also
for reasons mentioned above by @leslie-fang-intel

@jerryzh168

Copy link
Copy Markdown
Contributor

@choudhary-devang OK that sounds good, we just copy pasted the pt2e quant code to torchao, could you reopen this PR in torchao instead? https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e

@freddan80

Copy link
Copy Markdown

So, I was thinking we can have a seperate ARM quantizer (arm_inductor_quantizer.py) like how
it is this currrent PR instead of merging it into common one for ease of maintainibility and also
for reasons mentioned above by @leslie-fang-intel

My only concern is about naming. I think OneDNN should be there in the name somehow, or there'll be confusion. For example, XNNpack has its XNNpackQuantizer, which runs on Arm CPU's. To align with that naming convention, OneDNN should be in the name imo. Naming is hard - I do think arm_inductor_quantizer and arm_quantizer will be mixed up and cause confusion.

@choudhary-devang

Copy link
Copy Markdown
Author

@choudhary-devang OK that sounds good, we just copy pasted the pt2e quant code to torchao, could you reopen this PR in torchao instead? https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e

Hi @jerryzh168, I have created a new pr as requested in torchao
pytorch/ao#2139

@github-actions

Copy link
Copy Markdown
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale label Jun 27, 2025
@github-actions github-actions Bot closed this Jul 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: AO frontend release notes: quantization release notes category Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.