Unroll multiply and add instructions in dotProd_32f by ChipKerchner · Pull Request #15136 · opencv/opencv

ChipKerchner · 2019-07-24T16:00:38Z

Unroll multiply and add instructions (absorb latencies) into separate accumulators in dotProd_32f - 35% faster.

alalek · 2019-07-24T20:43:29Z

@terfendail Could you collect performance changes on IA?

tomoaki0705 · 2019-07-24T21:38:46Z

modules/core/src/matmul.simd.hpp

+                              vx_load(src2 + j + (cWidth * 3)), v_sum3);
+        }
+
+        r += v_reduce_sum(v_sum1) + v_reduce_sum(v_sum2) + v_reduce_sum(v_sum3);


v_reduce_sum is not really fast especially if the HW doesn't have a horizontal sum up instruction.
It's better to reduce the number of call to v_reduce_sum

r += v_reduce_sum(v_sum1 + v_sum2 + v_sum3)

I believe we can eliminate v_reduce_sum() call from here completely:

v_sum += v_sum1 + v_sum2 + v_sum3;

terfendail · 2019-08-01T14:05:12Z

[58/1943]

Performance for SSE2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8UC1, 256)	0.004	0.004	1.01
dot::MatType_Length::(8UC1, 512)	0.018	0.018	0.97
dot::MatType_Length::(8UC1, 1024)	0.082	0.083	0.98
dot::MatType_Length::(32SC1, 32)	0.001	0.001	1.03
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.00
dot::MatType_Length::(32SC1, 128)	0.008	0.008	1.00
dot::MatType_Length::(32SC1, 256)	0.033	0.033	1.00
dot::MatType_Length::(32SC1, 512)	0.129	0.129	1.00
dot::MatType_Length::(32SC1, 1024)	0.535	0.536	1.00
dot::MatType_Length::(32FC1, 32)	0.000	0.000	2.21
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.92
dot::MatType_Length::(32FC1, 128)	0.004	0.002	1.80
dot::MatType_Length::(32FC1, 256)	0.016	0.009	1.78
dot::MatType_Length::(32FC1, 512)	0.076	0.077	0.98
dot::MatType_Length::(32FC1, 1024)	0.322	0.332	0.97

[34/1943]

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	0.93
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.13
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.03
dot::MatType_Length::(8UC1, 256)	0.004	0.004	1.02
dot::MatType_Length::(8UC1, 512)	0.018	0.017	1.03
dot::MatType_Length::(8UC1, 1024)	0.083	0.083	1.00
dot::MatType_Length::(32SC1, 32)	0.001	0.001	0.99
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.02
dot::MatType_Length::(32SC1, 128)	0.008	0.008	1.03
dot::MatType_Length::(32SC1, 256)	0.033	0.032	1.03
dot::MatType_Length::(32SC1, 512)	0.129	0.129	1.00
dot::MatType_Length::(32SC1, 1024)	0.536	0.511	1.05
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.78
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.92
dot::MatType_Length::(32FC1, 128)	0.004	0.002	1.85
dot::MatType_Length::(32FC1, 256)	0.016	0.009	1.84
dot::MatType_Length::(32FC1, 512)	0.078	0.077	1.01
dot::MatType_Length::(32FC1, 1024)	0.329	0.330	1.00

[10/1943]

Performance for SSE4_2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.02
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8UC1, 256)	0.004	0.004	1.00
dot::MatType_Length::(8UC1, 512)	0.017	0.017	1.00
dot::MatType_Length::(8UC1, 1024)	0.090	0.081	1.11
dot::MatType_Length::(32SC1, 32)	0.001	0.001	1.00
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.00
dot::MatType_Length::(32SC1, 128)	0.008	0.008	1.00
dot::MatType_Length::(32SC1, 256)	0.033	0.033	1.00
dot::MatType_Length::(32SC1, 512)	0.129	0.130	1.00
dot::MatType_Length::(32SC1, 1024)	0.541	0.535	1.01
dot::MatType_Length::(32FC1, 32)	0.000	0.000	2.21
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.95
dot::MatType_Length::(32FC1, 128)	0.004	0.002	1.79
dot::MatType_Length::(32FC1, 256)	0.016	0.009	1.79
dot::MatType_Length::(32FC1, 512)	0.075	0.073	1.03
dot::MatType_Length::(32FC1, 1024)	0.340	0.331	1.03

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8UC1, 256)	0.003	0.003	1.00
dot::MatType_Length::(8UC1, 512)	0.013	0.013	1.00
dot::MatType_Length::(8UC1, 1024)	0.078	0.078	1.01
dot::MatType_Length::(32SC1, 32)	0.000	0.000	1.02
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.02
dot::MatType_Length::(32SC1, 128)	0.007	0.007	1.01
dot::MatType_Length::(32SC1, 256)	0.027	0.027	1.00
dot::MatType_Length::(32SC1, 512)	0.109	0.108	1.01
dot::MatType_Length::(32SC1, 1024)	0.448	0.453	0.99
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.36
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.18
dot::MatType_Length::(32FC1, 128)	0.002	0.002	1.18
dot::MatType_Length::(32FC1, 256)	0.009	0.008	1.22
dot::MatType_Length::(32FC1, 512)	0.078	0.079	0.99
dot::MatType_Length::(32FC1, 1024)	0.326	0.323	1.01

* Unroll multiply and add instructions in dotProd_32f - 35% faster. * Eliminate unnecessary v_reduce_sum instructions.

Unroll multiply and add instructions in dotProd_32f - 35% faster.

20ebb8e

ChipKerchner changed the base branch from master to 3.4 July 24, 2019 18:58

tomoaki0705 reviewed Jul 24, 2019

View reviewed changes

Eliminate unnecessary v_reduce_sum instructions.

74cf21a

alalek approved these changes Jul 25, 2019

View reviewed changes

alalek merged commit 0db4fb1 into opencv:3.4 Jul 25, 2019

alalek mentioned this pull request Jul 25, 2019

Merge 3.4 #15152

Merged

dvd42 pushed a commit to dvd42/opencv that referenced this pull request Aug 6, 2019

Merge pull request opencv#15136 from ChipKerchner:dotProd_unroll

0276ca0

* Unroll multiply and add instructions in dotProd_32f - 35% faster. * Eliminate unnecessary v_reduce_sum instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unroll multiply and add instructions in dotProd_32f#15136

Unroll multiply and add instructions in dotProd_32f#15136
alalek merged 2 commits intoopencv:3.4from
ChipKerchner:dotProd_unroll

ChipKerchner commented Jul 24, 2019

Uh oh!

alalek commented Jul 24, 2019

Uh oh!

tomoaki0705 Jul 24, 2019

Uh oh!

alalek Jul 25, 2019

Uh oh!

terfendail commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ChipKerchner commented Jul 24, 2019

Uh oh!

alalek commented Jul 24, 2019

Uh oh!

tomoaki0705 Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

alalek Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

terfendail commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants