Skip to content

Reduce LLC loads, stores and multiplies on MulTransposed - 8% faster#16375

Merged
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:vectorizeMultTranspose
Jan 24, 2020
Merged

Reduce LLC loads, stores and multiplies on MulTransposed - 8% faster#16375
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:vectorizeMultTranspose

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

@ChipKerchner ChipKerchner commented Jan 17, 2020

Reduce LLC loads, stores and multiplies by 2x on MulTransposed - 8% faster

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

Last failure doesn't seem to be related to my check-in

double s0 = 0, s1 = 0, s2 = 0, s3 = 0;
const sT *tsrc = src + j;
#if CV_SIMD_64F
if (is_same<sT, double>::value && is_same<dT, double>::value)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try type traits from OpenCV:

DataType<sT>::depth == CV_64F && DataType<dT>::depth == CV_64F

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alalek
Copy link
Copy Markdown
Member

alalek commented Jan 20, 2020

/cc @terfendail BTW, dispatching is blocked by CV_MULTRANSPOSED_BASELINE_ONLY for this function.

@terfendail
Copy link
Copy Markdown
Contributor

The change looks fine for me.
However I can't reproduce the performance gain. There is no dedicated performance test for the function, so I've used accuracy test evaluation time. For SSE3 baseline I've got performance gain of about 3 per cent, for SSE4_2 baseline I've got performance degradation of 2-3 per cent. Both results look like random fluctuation rather than stable performance change.

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

The performance gains were measured on a Power9 VSX system. It is possible that the gains only show up on a non-Intel platform since there are different stalls related to memory access for this platform.

@alalek alalek merged commit 4d2da2d into opencv:3.4 Jan 24, 2020
@ChipKerchner ChipKerchner deleted the vectorizeMultTranspose branch January 27, 2020 15:53
@alalek alalek mentioned this pull request Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants