[Transformations] Apply SDPA scale after MatMul(Q, K^T) per specification by evkotov · Pull Request #34177 · openvinotoolkit/openvino

evkotov · 2026-02-18T12:14:54Z

Summary

Apply SDPA scale factor unconditionally after MatMul(Q, K^T) per the SDPA specification: attn_weight = Q @ K^T * scale.

This replaces the previous conditional logic (can_move_scale_after_matmul) that applied scale either after MatMul or before MatMul on Q depending on shape analysis.

Details

When ov.convert_model() processes transformer models, SDPAFusion creates SDPA nodes during MOC transformations. The CPU plugin then decomposes these back to MatMul + Softmax + MatMul via ScaledDotProductAttentionDecomposition. The previous decomposition sometimes applied scale to Q before MatMul ((Q * scale) @ K^T), which changed the FP32 computation order compared to the original graph. While mathematically equivalent, the rounding differences accumulate through residual connections across transformer layers (up to 0.91 max_diff on RFDetr with 14 attention blocks).

The fix simplifies the decomposition to always compute (Q @ K^T) * scale, which:

Matches the SDPA specification exactly
Post-MatMul scalar Multiply fuses into oneDNN MatMul primitive (no performance regression)
Produces consistent behavior regardless of shape dynamism

Tickets:

181409
180477

src/common/transformations/src/transformations/common_optimizations/common_optimizations.cpp

CuriousPanCake · 2026-02-25T09:44:16Z

If I read the description of your PR correctly:

This change moves SDPAFusion (and SDPAScaleFusion) from MOCTransformations to CommonOptimizations so they only run during compile_model(), where each plugin can control whether SDPA nodes are created and kept.

there might be a case when SDPA is not fused, and it's not going to be fused at the conversion stage, which is completely not ok for the SDPAToPA case which requires the fused SDPA. Maybe there're other problematic cases.

but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers.

What do you mean? Is the order of operations between fused and unfused SDPA implementations different? How so? This is a strict formula with matrix multiplication where order is crucial.

evkotov · 2026-03-05T16:34:02Z

If I read the description of your PR correctly:

This change moves SDPAFusion (and SDPAScaleFusion) from MOCTransformations to CommonOptimizations so they only run during compile_model(), where each plugin can control whether SDPA nodes are created and kept.

there might be a case when SDPA is not fused, and it's not going to be fused at the conversion stage, which is completely not ok for the SDPAToPA case which requires the fused SDPA. Maybe there're other problematic cases.

but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers.

What do you mean? Is the order of operations between fused and unfused SDPA implementations different? How so? This is a strict formula with matrix multiplication where order is crucial.

I updated approach with new description

CuriousPanCake

LGTM

Copilot

Pull request overview

Updates the ScaledDotProductAttentionDecomposition transformation to preserve the original scaling order in cases where the SDPA query input is already pre-scaled, aiming to reduce FP32 rounding divergence after decomposition (notably for CPU paths that decompose SDPA to MatMul+Softmax+MatMul).

Changes:

Add is_query_prescaled() heuristic and, when it matches, apply scale to K^T instead of applying it on Q or after the first MatMul.
Extend the decomposition unit test suite with a regression-style graph-structure test for the pre-scaled query case.
Update the test helper to optionally build a reference graph that applies scale on K^T.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp`	Adds pre-scaled-query detection and switches scale placement to `K^T` in that case.
`src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp`	Adds a new test validating the expected decomposition structure for pre-scaled query + explicit scale.

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp

SDPAFusion was registered inside MOCTransformations, which runs both during ov.convert_model() and during compile_model(). The CPU plugin explicitly disables SDPAFusion because it does not use SDPA nodes, but when ov.convert_model() is used, SDPA nodes are already created before compile_model() is called. The CPU plugin then decomposes them back via ScaledDotProductAttentionDecomposition, but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers (0.91 max_diff on RFDetr with 14 attention blocks). Move SDPAFusion and SDPAScaleFusion to CommonOptimizations so they only run during compile_model(), where each plugin controls whether SDPA nodes are created and kept. Tickets: CVS-180477

When SDPAFusion absorbs a K-side scale into the SDPA node, the query input may already be pre-scaled (e.g. Q * 0.353). During decomposition, the scale was applied to Q again or moved after the MatMul, changing the FP32 computation order from the original (Q * s) @ (K^T * s) to ((Q * s) * s) @ K^T. While mathematically equivalent, this produces different intermediate rounding and accumulates ~0.91 max_diff over 14 transformer layers in models like RFDetr. Add is_query_prescaled() check in ScaledDotProductAttentionDecomposition: if Q is already a Multiply(input, scalar_constant), apply the scale to K^T instead, restoring the original computation order. Fixes CVS-180477 (Bug 2a).

Remove is_query_prescaled() heuristic and can_move_scale_after_matmul() optimization. Always apply SDPA scale to K^T during decomposition, since the scale logically belongs to K^T (absorbed from K-side by SDPAFusion). This is mathematically equivalent for all cases and preserves the original computation order for models with symmetric Q/K pre-scaling (e.g. PyTorch scaled_dot_product_attention export), fixing the FP32 rounding divergence that accumulated through transformer layers (CVS-180477, Bug 2a). Tickets: CVS-181409, CVS-180477

…fallback Address PR openvinotoolkit#34177 review comments: - [HIGH] Restore can_move_scale_after_matmul() size-based heuristic as performance fallback for non-prescaled query cases (e.g. decode S_q=1) - [LOW] Reword comments to not imply SDPAFusion is always involved Three-way scale placement logic: 1. Q pre-scaled (Multiply(Q, scalar_const)) -> scale K^T (precision fix) 2. can_move_scale_after_matmul -> scale after MatMul (perf optimization) 3. Default -> scale Q

Copilot

Pull request overview

Updates the ScaledDotProductAttention decomposition to apply the SDPA scale factor on the K^T operand unconditionally, aligning the decomposition’s computation order with graphs produced by SDPAFusion (notably for PyTorch-exported pre-scaled Q/K) to reduce FP32 rounding drift.

Changes:

Changed SDPA decomposition to always multiply K^T by scale before the Q·K^T MatMul (removing the previous heuristic that sometimes scaled Q or post-MatMul output).
Simplified the unit-test reference decomposition helper to match the new behavior.
Renamed/updated scale-related tests and added coverage for the “pre-scaled query” case to validate the intended computation order.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp`	Applies `scale` on `K^T` unconditionally in the decomposition (removes size-based conditional scaling).
`src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp`	Updates reference graph builder + test expectations; adds a regression test for pre-scaled query to ensure scaling is applied on `K^T`.

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp

Copilot

Pull request overview

This PR updates the ScaledDotProductAttentionDecomposition transformation to apply the SDPA scale unconditionally after MatMul(Q, K^T), aligning the decomposition with the SDPA operation specification and avoiding FP32 reordering differences introduced by scaling Q before the matmul in some cases.

Changes:

Removed conditional logic that sometimes applied scale before MatMul (i.e., on Q) and now always computes (Q @ K^T) * scale.
Updated/renamed existing unit tests to reflect the new, unconditional “multiply after matmul” behavior.
Added a regression-style test covering the case where Q is already pre-scaled upstream, ensuring decomposition still applies SDPA scale after MatMul.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp`	Simplifies decomposition to always apply `scale` after `MatMul(Q, K^T)` per SDPA spec.
`src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp`	Aligns references/tests with the new decomposition behavior and adds coverage for pre-scaled `Q`.

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp

Co-authored-by: Mikhail Ryzhov <mikhail.ryzhov@intel.com>

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp

Co-authored-by: Andrii Staikov <andrii.staikov@intel.com>

Copilot

Pull request overview

This PR updates the ScaledDotProductAttentionDecomposition transformation to apply the SDPA scale factor unconditionally after MatMul(Q, K^T), matching the SDPA operation specification and avoiding numerical drift caused by alternative (pre-MatMul) scaling orders.

Changes:

Remove the can_move_scale_after_matmul heuristic and always compute (Q @ K^T) * scale in the decomposition pass.
Update decomposition unit tests to reflect the new unconditional post-MatMul scaling behavior.
Add a regression test covering the “pre-scaled Q + SDPA scale” scenario to ensure scale is still applied after MatMul(Q, K^T).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp`	Simplifies SDPA decomposition to always apply `scale` after `MatMul(Q, K^T)` per spec.
`src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp`	Adjusts existing references to the new behavior and adds a test for pre-scaled query inputs.

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp

evkotov requested review from CuriousPanCake and mryzhov February 18, 2026 12:14

evkotov self-assigned this Feb 18, 2026

evkotov requested a review from a team as a code owner February 18, 2026 12:14

evkotov added the category: transformations OpenVINO Runtime library - Transformations label Feb 18, 2026

v-Golubev requested changes Feb 19, 2026

View reviewed changes

src/common/transformations/src/transformations/common_optimizations/common_optimizations.cpp Outdated Show resolved Hide resolved

evkotov changed the title ~~Move SDPAFusion from MOCTransformations to CommonOptimizations~~ Fix SDPA decomposition to preserve original scale application order Feb 26, 2026

evkotov requested a review from v-Golubev February 26, 2026 12:46

v-Golubev self-assigned this Feb 26, 2026

evkotov force-pushed the CVS-181409 branch from d9613e2 to 6446497 Compare March 4, 2026 10:59

evkotov changed the title ~~Fix SDPA decomposition to preserve original scale application order~~ [Transformations] Apply SDPA scale to K^T when query is pre-scaled Mar 5, 2026

CuriousPanCake approved these changes Mar 16, 2026

View reviewed changes

mryzhov requested a review from Copilot March 16, 2026 11:16

Copilot started reviewing on behalf of mryzhov March 16, 2026 11:16 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Outdated Show resolved Hide resolved

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Outdated Show resolved Hide resolved

evkotov added 8 commits March 16, 2026 13:30

cleanup

b51110f

clang

914acc1

clang

fa89e73

clang

6e635e5

evkotov force-pushed the CVS-181409 branch 2 times, most recently from 5e8c833 to fa89e73 Compare March 16, 2026 13:10

evkotov requested a review from Copilot March 16, 2026 13:10

Copilot started reviewing on behalf of evkotov March 16, 2026 13:11 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Outdated Show resolved Hide resolved

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Outdated Show resolved Hide resolved

mryzhov requested a review from maxnick March 17, 2026 09:42

evkotov added 2 commits March 17, 2026 12:21

code review fixes

ed4efbf

clang

da17c40

mryzhov approved these changes Mar 17, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

mryzhov requested a review from Copilot March 17, 2026 15:31

Copilot started reviewing on behalf of mryzhov March 17, 2026 15:32 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Merge branch 'master' into CVS-181409

b5d19f8

v-Golubev reviewed Mar 26, 2026

View reviewed changes

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Show resolved Hide resolved

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

mryzhov reviewed Mar 31, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

mryzhov reviewed Mar 31, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

94ae53e

Co-authored-by: Mikhail Ryzhov <mikhail.ryzhov@intel.com>

mryzhov requested review from Copilot and v-Golubev March 31, 2026 10:45

CuriousPanCake reviewed Mar 31, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

CuriousPanCake reviewed Mar 31, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

Copilot started reviewing on behalf of mryzhov March 31, 2026 10:45 View session

mryzhov and others added 3 commits March 31, 2026 12:46

Merge branch 'master' into CVS-181409

58ff124

Apply suggestion from @CuriousPanCake

ca4fb50

Co-authored-by: Andrii Staikov <andrii.staikov@intel.com>

Apply suggestion from @CuriousPanCake

6b240d5

Co-authored-by: Andrii Staikov <andrii.staikov@intel.com>

Copilot AI reviewed Mar 31, 2026

View reviewed changes

...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Show resolved Hide resolved

v-Golubev approved these changes Mar 31, 2026

View reviewed changes

CuriousPanCake reviewed Mar 31, 2026

View reviewed changes

src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Outdated Show resolved Hide resolved

CuriousPanCake and others added 2 commits March 31, 2026 18:20

Apply suggestion from @CuriousPanCake

df27cc0

Merge branch 'master' into CVS-181409

2bfb5d8

mryzhov enabled auto-merge March 31, 2026 16:24

mryzhov added this pull request to the merge queue Apr 1, 2026

Merged via the queue into openvinotoolkit:master with commit 6eb738b Apr 1, 2026
206 checks passed

evkotov deleted the CVS-181409 branch April 2, 2026 10:08

Conversation

evkotov commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Tickets:

Uh oh!

Uh oh!

CuriousPanCake commented Feb 25, 2026

Uh oh!

evkotov commented Mar 5, 2026

Uh oh!

CuriousPanCake left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

evkotov commented Feb 18, 2026 •

edited

Loading