[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

adrianlizarraga · 2025-04-24T20:24:59Z

Description

Updates the WeightBiasQuantization optimizer to skip processing on Conv/Gemm nodes if the downstream child node is not a QuantizeLinear.

Before this PR

Original graph:

input_0 -> DQ -> Conv -> graph_output (or non-Q node)
                 ^  ^
                 |  |
weights_f32------+
                    |
bias_f32------------+

Becomes:

input_0 -> DQ ------> Conv -> graph_output (or non-Q node)
                      ^  ^
                      |  |
weights_quant -> DQ --+
                         |
bias_quant -> DQ --------+

The above is NOT a valid QDQ node unit for Conv because the Conv's output is not consumed by a QuantizeLinear node.

With this PR

The above example graph remains unchanged after L1 optimizations:

input_0 -> DQ -> Conv -> graph_output (or non-Q node)
                 ^  ^
                 |  |
weights_f32------+
                    |
bias_f32------------+

Motivation and Context

Caused inaccuracy for a customer model. Automatically quantizing the weights and biases of a Conv/Gemm is detrimental if the output of the Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario, the whole node group is not considered a valid QDQ node unit, and so the EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm is running as float32/float16, then quantizing the weights and biases introduces inaccuracy for no gain.

PR that originally added this optimizer: #22969

…antized output

adrianlizarraga · 2025-04-24T20:31:44Z

Hi @vraspar FYI, to be cherry-picked for ORT 1.22.0

HectorSVC

…wnstream node is not QuantizeLinear (#24537) ### Description Updates the WeightBiasQuantization optimizer to skip processing on Conv/Gemm nodes if the downstream child node is not a QuantizeLinear. #### Before this PR Original graph: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` Becomes: ``` input_0 -> DQ ------> Conv -> graph_output (or non-Q node) ^ ^ | | weights_quant -> DQ --+ | bias_quant -> DQ --------+ ``` The above is **NOT** a valid QDQ node unit for Conv because the Conv's output is not consumed by a QuantizeLinear node. #### With this PR The above example graph remains unchanged after L1 optimizations: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` ### Motivation and Context Caused inaccuracy for a customer model. Automatically quantizing the weights and biases of a Conv/Gemm is detrimental if the output of the Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario, the whole node group is not considered a valid QDQ node unit, and so the EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm is running as float32/float16, then quantizing the weights and biases introduces inaccuracy for no gain. PR that originally added this optimizer: #22969

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24487) - (#24466) - (#24493) - (#24484) - (#24494) - (#24489) - (#24504) - (#24510) - (#24456) - (#24537) - (#24501) - (#24519) - (#24513) - (#24539) - (#24514) - (#24542) - (#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (#24491) - (#24509) - (#24564) --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (microsoft#24487) - (microsoft#24466) - (microsoft#24493) - (microsoft#24484) - (microsoft#24494) - (microsoft#24489) - (microsoft#24504) - (microsoft#24510) - (microsoft#24456) - (microsoft#24537) - (microsoft#24501) - (microsoft#24519) - (microsoft#24513) - (microsoft#24539) - (microsoft#24514) - (microsoft#24542) - (microsoft#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (microsoft#24491) - (microsoft#24509) - (microsoft#24564) --------- Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

…wnstream node is not QuantizeLinear (microsoft#24537) ### Description Updates the WeightBiasQuantization optimizer to skip processing on Conv/Gemm nodes if the downstream child node is not a QuantizeLinear. #### Before this PR Original graph: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` Becomes: ``` input_0 -> DQ ------> Conv -> graph_output (or non-Q node) ^ ^ | | weights_quant -> DQ --+ | bias_quant -> DQ --------+ ``` The above is **NOT** a valid QDQ node unit for Conv because the Conv's output is not consumed by a QuantizeLinear node. #### With this PR The above example graph remains unchanged after L1 optimizations: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` ### Motivation and Context Caused inaccuracy for a customer model. Automatically quantizing the weights and biases of a Conv/Gemm is detrimental if the output of the Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario, the whole node group is not considered a valid QDQ node unit, and so the EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm is running as float32/float16, then quantizing the weights and biases introduces inaccuracy for no gain. PR that originally added this optimizer: microsoft#22969

snnn · 2025-09-05T20:48:14Z

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

[QDQ Optimizer] Update WeightBiasQuantization to skip nodes with unqu…

d15eda0

…antized output

adrianlizarraga requested review from HectorSVC, Lafi7e and jywu-msft April 24, 2025 20:25

adrianlizarraga added the release:1.22.0 label Apr 24, 2025

HectorSVC approved these changes Apr 24, 2025

View reviewed changes

adrianlizarraga merged commit 173a11a into main Apr 24, 2025
87 of 89 checks passed

adrianlizarraga deleted the adrianl/optimizer-weight-bias-quant-fix-regression branch April 24, 2025 22:50

vraspar mentioned this pull request Apr 28, 2025

Cherry-picks into rel-1.22.0 #24580

Merged

snnn removed the release:1.22.0 label Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

Uh oh!

adrianlizarraga commented Apr 24, 2025 •

edited

Loading

Uh oh!

adrianlizarraga commented Apr 24, 2025

Uh oh!

HectorSVC left a comment

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

Uh oh!

Conversation

adrianlizarraga commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before this PR

With this PR

Motivation and Context

Uh oh!

adrianlizarraga commented Apr 24, 2025

Uh oh!

HectorSVC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adrianlizarraga commented Apr 24, 2025 •

edited

Loading