Skip to content

Unroll multiply and add instructions in dotProd_32f#15136

Merged
alalek merged 2 commits intoopencv:3.4from
ChipKerchner:dotProd_unroll
Jul 25, 2019
Merged

Unroll multiply and add instructions in dotProd_32f#15136
alalek merged 2 commits intoopencv:3.4from
ChipKerchner:dotProd_unroll

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

Unroll multiply and add instructions (absorb latencies) into separate accumulators in dotProd_32f - 35% faster.

@ChipKerchner ChipKerchner changed the base branch from master to 3.4 July 24, 2019 18:58
@alalek
Copy link
Copy Markdown
Member

alalek commented Jul 24, 2019

@terfendail Could you collect performance changes on IA?

vx_load(src2 + j + (cWidth * 3)), v_sum3);
}

r += v_reduce_sum(v_sum1) + v_reduce_sum(v_sum2) + v_reduce_sum(v_sum3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v_reduce_sum is not really fast especially if the HW doesn't have a horizontal sum up instruction.
It's better to reduce the number of call to v_reduce_sum

r += v_reduce_sum(v_sum1 + v_sum2 + v_sum3)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can eliminate v_reduce_sum() call from here completely:

v_sum += v_sum1 + v_sum2 + v_sum3;

@alalek alalek merged commit 0db4fb1 into opencv:3.4 Jul 25, 2019
@alalek alalek mentioned this pull request Jul 25, 2019
@terfendail
Copy link
Copy Markdown
Contributor

[58/1943] Performance for SSE2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8UC1, 256) 0.004 0.004 1.01
dot::MatType_Length::(8UC1, 512) 0.018 0.018 0.97
dot::MatType_Length::(8UC1, 1024) 0.082 0.083 0.98
dot::MatType_Length::(32SC1, 32) 0.001 0.001 1.03
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.00
dot::MatType_Length::(32SC1, 128) 0.008 0.008 1.00
dot::MatType_Length::(32SC1, 256) 0.033 0.033 1.00
dot::MatType_Length::(32SC1, 512) 0.129 0.129 1.00
dot::MatType_Length::(32SC1, 1024) 0.535 0.536 1.00
dot::MatType_Length::(32FC1, 32) 0.000 0.000 2.21
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.92
dot::MatType_Length::(32FC1, 128) 0.004 0.002 1.80
dot::MatType_Length::(32FC1, 256) 0.016 0.009 1.78
dot::MatType_Length::(32FC1, 512) 0.076 0.077 0.98
dot::MatType_Length::(32FC1, 1024) 0.322 0.332 0.97
[34/1943] Performance for SSE3 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 0.93
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.13
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.03
dot::MatType_Length::(8UC1, 256) 0.004 0.004 1.02
dot::MatType_Length::(8UC1, 512) 0.018 0.017 1.03
dot::MatType_Length::(8UC1, 1024) 0.083 0.083 1.00
dot::MatType_Length::(32SC1, 32) 0.001 0.001 0.99
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.02
dot::MatType_Length::(32SC1, 128) 0.008 0.008 1.03
dot::MatType_Length::(32SC1, 256) 0.033 0.032 1.03
dot::MatType_Length::(32SC1, 512) 0.129 0.129 1.00
dot::MatType_Length::(32SC1, 1024) 0.536 0.511 1.05
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.78
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.92
dot::MatType_Length::(32FC1, 128) 0.004 0.002 1.85
dot::MatType_Length::(32FC1, 256) 0.016 0.009 1.84
dot::MatType_Length::(32FC1, 512) 0.078 0.077 1.01
dot::MatType_Length::(32FC1, 1024) 0.329 0.330 1.00
[10/1943] Performance for SSE4_2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.02
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8UC1, 256) 0.004 0.004 1.00
dot::MatType_Length::(8UC1, 512) 0.017 0.017 1.00
dot::MatType_Length::(8UC1, 1024) 0.090 0.081 1.11
dot::MatType_Length::(32SC1, 32) 0.001 0.001 1.00
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.00
dot::MatType_Length::(32SC1, 128) 0.008 0.008 1.00
dot::MatType_Length::(32SC1, 256) 0.033 0.033 1.00
dot::MatType_Length::(32SC1, 512) 0.129 0.130 1.00
dot::MatType_Length::(32SC1, 1024) 0.541 0.535 1.01
dot::MatType_Length::(32FC1, 32) 0.000 0.000 2.21
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.95
dot::MatType_Length::(32FC1, 128) 0.004 0.002 1.79
dot::MatType_Length::(32FC1, 256) 0.016 0.009 1.79
dot::MatType_Length::(32FC1, 512) 0.075 0.073 1.03
dot::MatType_Length::(32FC1, 1024) 0.340 0.331 1.03
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8UC1, 256) 0.003 0.003 1.00
dot::MatType_Length::(8UC1, 512) 0.013 0.013 1.00
dot::MatType_Length::(8UC1, 1024) 0.078 0.078 1.01
dot::MatType_Length::(32SC1, 32) 0.000 0.000 1.02
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.02
dot::MatType_Length::(32SC1, 128) 0.007 0.007 1.01
dot::MatType_Length::(32SC1, 256) 0.027 0.027 1.00
dot::MatType_Length::(32SC1, 512) 0.109 0.108 1.01
dot::MatType_Length::(32SC1, 1024) 0.448 0.453 0.99
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.36
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.18
dot::MatType_Length::(32FC1, 128) 0.002 0.002 1.18
dot::MatType_Length::(32FC1, 256) 0.009 0.008 1.22
dot::MatType_Length::(32FC1, 512) 0.078 0.079 0.99
dot::MatType_Length::(32FC1, 1024) 0.326 0.323 1.01

dvd42 pushed a commit to dvd42/opencv that referenced this pull request Aug 6, 2019
* Unroll multiply and add instructions in dotProd_32f - 35% faster.

* Eliminate unnecessary v_reduce_sum instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants