[Transformations] Apply SDPA scale after MatMul(Q, K^T) per specification#34177
[Transformations] Apply SDPA scale after MatMul(Q, K^T) per specification#34177mryzhov merged 22 commits intoopenvinotoolkit:masterfrom
Conversation
src/common/transformations/src/transformations/common_optimizations/common_optimizations.cpp
Outdated
Show resolved
Hide resolved
there might be a case when SDPA is not fused, and it's not going to be fused at the conversion stage, which is completely not ok for the SDPAToPA case which requires the fused SDPA. Maybe there're other problematic cases.
What do you mean? Is the order of operations between fused and unfused SDPA implementations different? How so? This is a strict formula with matrix multiplication where order is crucial. |
I updated approach with new description |
There was a problem hiding this comment.
Pull request overview
Updates the ScaledDotProductAttentionDecomposition transformation to preserve the original scaling order in cases where the SDPA query input is already pre-scaled, aiming to reduce FP32 rounding divergence after decomposition (notably for CPU paths that decompose SDPA to MatMul+Softmax+MatMul).
Changes:
- Add
is_query_prescaled()heuristic and, when it matches, applyscaletoK^Tinstead of applying it onQor after the first MatMul. - Extend the decomposition unit test suite with a regression-style graph-structure test for the pre-scaled query case.
- Update the test helper to optionally build a reference graph that applies scale on
K^T.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp |
Adds pre-scaled-query detection and switches scale placement to K^T in that case. |
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp |
Adds a new test validating the expected decomposition structure for pre-scaled query + explicit scale. |
...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp
Outdated
Show resolved
Hide resolved
...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp
Outdated
Show resolved
Hide resolved
SDPAFusion was registered inside MOCTransformations, which runs both during ov.convert_model() and during compile_model(). The CPU plugin explicitly disables SDPAFusion because it does not use SDPA nodes, but when ov.convert_model() is used, SDPA nodes are already created before compile_model() is called. The CPU plugin then decomposes them back via ScaledDotProductAttentionDecomposition, but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers (0.91 max_diff on RFDetr with 14 attention blocks). Move SDPAFusion and SDPAScaleFusion to CommonOptimizations so they only run during compile_model(), where each plugin controls whether SDPA nodes are created and kept. Tickets: CVS-180477
When SDPAFusion absorbs a K-side scale into the SDPA node, the query input may already be pre-scaled (e.g. Q * 0.353). During decomposition, the scale was applied to Q again or moved after the MatMul, changing the FP32 computation order from the original (Q * s) @ (K^T * s) to ((Q * s) * s) @ K^T. While mathematically equivalent, this produces different intermediate rounding and accumulates ~0.91 max_diff over 14 transformer layers in models like RFDetr. Add is_query_prescaled() check in ScaledDotProductAttentionDecomposition: if Q is already a Multiply(input, scalar_constant), apply the scale to K^T instead, restoring the original computation order. Fixes CVS-180477 (Bug 2a).
Remove is_query_prescaled() heuristic and can_move_scale_after_matmul() optimization. Always apply SDPA scale to K^T during decomposition, since the scale logically belongs to K^T (absorbed from K-side by SDPAFusion). This is mathematically equivalent for all cases and preserves the original computation order for models with symmetric Q/K pre-scaling (e.g. PyTorch scaled_dot_product_attention export), fixing the FP32 rounding divergence that accumulated through transformer layers (CVS-180477, Bug 2a). Tickets: CVS-181409, CVS-180477
…fallback Address PR openvinotoolkit#34177 review comments: - [HIGH] Restore can_move_scale_after_matmul() size-based heuristic as performance fallback for non-prescaled query cases (e.g. decode S_q=1) - [LOW] Reword comments to not imply SDPAFusion is always involved Three-way scale placement logic: 1. Q pre-scaled (Multiply(Q, scalar_const)) -> scale K^T (precision fix) 2. can_move_scale_after_matmul -> scale after MatMul (perf optimization) 3. Default -> scale Q
5e8c833 to
fa89e73
Compare
There was a problem hiding this comment.
Pull request overview
Updates the ScaledDotProductAttention decomposition to apply the SDPA scale factor on the K^T operand unconditionally, aligning the decomposition’s computation order with graphs produced by SDPAFusion (notably for PyTorch-exported pre-scaled Q/K) to reduce FP32 rounding drift.
Changes:
- Changed SDPA decomposition to always multiply
K^Tbyscalebefore the Q·K^T MatMul (removing the previous heuristic that sometimes scaled Q or post-MatMul output). - Simplified the unit-test reference decomposition helper to match the new behavior.
- Renamed/updated scale-related tests and added coverage for the “pre-scaled query” case to validate the intended computation order.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp |
Applies scale on K^T unconditionally in the decomposition (removes size-based conditional scaling). |
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp |
Updates reference graph builder + test expectations; adds a regression test for pre-scaled query to ensure scaling is applied on K^T. |
...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp
Outdated
Show resolved
Hide resolved
...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR updates the ScaledDotProductAttentionDecomposition transformation to apply the SDPA scale unconditionally after MatMul(Q, K^T), aligning the decomposition with the SDPA operation specification and avoiding FP32 reordering differences introduced by scaling Q before the matmul in some cases.
Changes:
- Removed conditional logic that sometimes applied
scalebeforeMatMul(i.e., onQ) and now always computes(Q @ K^T) * scale. - Updated/renamed existing unit tests to reflect the new, unconditional “multiply after matmul” behavior.
- Added a regression-style test covering the case where
Qis already pre-scaled upstream, ensuring decomposition still applies SDPAscaleafterMatMul.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp |
Simplifies decomposition to always apply scale after MatMul(Q, K^T) per SDPA spec. |
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp |
Aligns references/tests with the new decomposition behavior and adds coverage for pre-scaled Q. |
...formations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp
Show resolved
Hide resolved
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
Co-authored-by: Mikhail Ryzhov <mikhail.ryzhov@intel.com>
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp
Outdated
Show resolved
Hide resolved
Co-authored-by: Andrii Staikov <andrii.staikov@intel.com>
Co-authored-by: Andrii Staikov <andrii.staikov@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the ScaledDotProductAttentionDecomposition transformation to apply the SDPA scale factor unconditionally after MatMul(Q, K^T), matching the SDPA operation specification and avoiding numerical drift caused by alternative (pre-MatMul) scaling orders.
Changes:
- Remove the
can_move_scale_after_matmulheuristic and always compute(Q @ K^T) * scalein the decomposition pass. - Update decomposition unit tests to reflect the new unconditional post-MatMul scaling behavior.
- Add a regression test covering the “pre-scaled Q + SDPA scale” scenario to ensure scale is still applied after
MatMul(Q, K^T).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp |
Simplifies SDPA decomposition to always apply scale after MatMul(Q, K^T) per spec. |
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp |
Adjusts existing references to the new behavior and adds a test for pre-scaled query inputs. |
Summary
Apply SDPA scale factor unconditionally after
MatMul(Q, K^T)per the SDPA specification:attn_weight = Q @ K^T * scale.This replaces the previous conditional logic (
can_move_scale_after_matmul) that applied scale either after MatMul or before MatMul on Q depending on shape analysis.Details
When
ov.convert_model()processes transformer models,SDPAFusioncreates SDPA nodes during MOC transformations. The CPU plugin then decomposes these back toMatMul + Softmax + MatMulviaScaledDotProductAttentionDecomposition. The previous decomposition sometimes applied scale to Q before MatMul ((Q * scale) @ K^T), which changed the FP32 computation order compared to the original graph. While mathematically equivalent, the rounding differences accumulate through residual connections across transformer layers (up to 0.91 max_diff on RFDetr with 14 attention blocks).The fix simplifies the decomposition to always compute
(Q @ K^T) * scale, which:Tickets: