core: vectorize countNonZero64f by pmur · Pull Request #15685 · opencv/opencv

pmur · 2019-10-10T20:01:53Z

Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake
x86-64.

This pullrequest changes

Update countNonZero64f to use SIMD as available on the target platform.

force_builders=Linux AVX2,Custom
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
disable_ipp=ON

ChipKerchner · 2019-10-11T12:49:56Z

modules/core/src/count_non_zero.simd.hpp

+
+    for(i = 0; i < len0; i += step )
+        {
+        sum1 += v_reinterpret_as_s64(vx_load(&src[i]) == zero);


Can't these loads and comparisons be done with s64 instead? Comparing to zero is the same and should be faster.

Floating point numbers are tricky.
There are two zeros possible: +0, -0, so interpretation as int64/uint64 is not valid.
Details: https://en.wikipedia.org/wiki/Signed_zero

~~BTW, The same bug is in countNonZero32f() optimization.~~

How about using a 63-bit mask on top of that?

v_zero_mask = 0x7fffffffffffffff;

((vx_load(&src[i]) & v_zero_mask) == zero)

Right!
Mask should work (on known platforms).

This is a subtle IEEE nuance. Assuming a compliant IEEE754-2008 implementation, the sign of zero is ignored. (chapter 5.11)

Yes, floating-point comparison works good.
but it is not equal to integer-based zero check:

comparisons be done with s64 instead

f32 optimization is fine (v_reinterpret applied after comparison).
Check in current implementation is fine too (due floating-point check).

Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake x86-64.

terfendail · 2019-10-18T13:04:32Z

Performance for SSE2 baseline

Performance test	Reference time	PR time	Speedup
countNonZero::Size_MatType::(127x61, 8UC1)	0.000	0.000	0.94
countNonZero::Size_MatType::(127x61, 8SC1)	0.000	0.000	0.97
countNonZero::Size_MatType::(127x61, 16UC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 16SC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 32SC1)	0.001	0.001	1.01
countNonZero::Size_MatType::(127x61, 32FC1)	0.001	0.001	1.00
countNonZero::Size_MatType::(127x61, 64FC1)	0.004	0.001	2.66
countNonZero::Size_MatType::(640x480, 8UC1)	0.011	0.011	1.01
countNonZero::Size_MatType::(640x480, 8SC1)	0.011	0.011	1.01
countNonZero::Size_MatType::(640x480, 16UC1)	0.015	0.015	1.02
countNonZero::Size_MatType::(640x480, 16SC1)	0.015	0.015	1.00
countNonZero::Size_MatType::(640x480, 32SC1)	0.038	0.035	1.10
countNonZero::Size_MatType::(640x480, 32FC1)	0.038	0.035	1.09
countNonZero::Size_MatType::(640x480, 64FC1)	0.134	0.088	1.52
countNonZero::Size_MatType::(1280x720, 8UC1)	0.034	0.033	1.01
countNonZero::Size_MatType::(1280x720, 8SC1)	0.034	0.033	1.02
countNonZero::Size_MatType::(1280x720, 16UC1)	0.071	0.064	1.11
countNonZero::Size_MatType::(1280x720, 16SC1)	0.072	0.065	1.11
countNonZero::Size_MatType::(1280x720, 32SC1)	0.140	0.133	1.06
countNonZero::Size_MatType::(1280x720, 32FC1)	0.140	0.131	1.07
countNonZero::Size_MatType::(1280x720, 64FC1)	0.405	0.278	1.46
countNonZero::Size_MatType::(1920x1080, 8UC1)	0.089	0.079	1.13
countNonZero::Size_MatType::(1920x1080, 8SC1)	0.089	0.079	1.12
countNonZero::Size_MatType::(1920x1080, 16UC1)	0.160	0.149	1.07
countNonZero::Size_MatType::(1920x1080, 16SC1)	0.160	0.151	1.07
countNonZero::Size_MatType::(1920x1080, 32SC1)	0.333	0.317	1.05
countNonZero::Size_MatType::(1920x1080, 32FC1)	0.337	0.321	1.05
countNonZero::Size_MatType::(1920x1080, 64FC1)	1.240	0.878	1.41

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
countNonZero::Size_MatType::(127x61, 8UC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 8SC1)	0.000	0.000	0.98
countNonZero::Size_MatType::(127x61, 16UC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 16SC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 32SC1)	0.001	0.001	1.02
countNonZero::Size_MatType::(127x61, 32FC1)	0.001	0.001	0.98
countNonZero::Size_MatType::(127x61, 64FC1)	0.003	0.001	2.63
countNonZero::Size_MatType::(640x480, 8UC1)	0.012	0.012	1.01
countNonZero::Size_MatType::(640x480, 8SC1)	0.012	0.012	1.00
countNonZero::Size_MatType::(640x480, 16UC1)	0.015	0.015	0.98
countNonZero::Size_MatType::(640x480, 16SC1)	0.015	0.015	1.00
countNonZero::Size_MatType::(640x480, 32SC1)	0.034	0.038	0.88
countNonZero::Size_MatType::(640x480, 32FC1)	0.032	0.038	0.84
countNonZero::Size_MatType::(640x480, 64FC1)	0.130	0.094	1.39
countNonZero::Size_MatType::(1280x720, 8UC1)	0.037	0.037	1.00
countNonZero::Size_MatType::(1280x720, 8SC1)	0.037	0.037	0.99
countNonZero::Size_MatType::(1280x720, 16UC1)	0.067	0.070	0.95
countNonZero::Size_MatType::(1280x720, 16SC1)	0.067	0.070	0.95
countNonZero::Size_MatType::(1280x720, 32SC1)	0.135	0.142	0.95
countNonZero::Size_MatType::(1280x720, 32FC1)	0.136	0.141	0.96
countNonZero::Size_MatType::(1280x720, 64FC1)	0.404	0.299	1.35
countNonZero::Size_MatType::(1920x1080, 8UC1)	0.083	0.098	0.84
countNonZero::Size_MatType::(1920x1080, 8SC1)	0.085	0.098	0.86
countNonZero::Size_MatType::(1920x1080, 16UC1)	0.156	0.164	0.95
countNonZero::Size_MatType::(1920x1080, 16SC1)	0.154	0.163	0.94
countNonZero::Size_MatType::(1920x1080, 32SC1)	0.328	0.349	0.94
countNonZero::Size_MatType::(1920x1080, 32FC1)	0.325	0.348	0.93
countNonZero::Size_MatType::(1920x1080, 64FC1)	1.254	0.931	1.35

Performance for SSE4_2 baseline

Performance test	Reference time	PR time	Speedup
countNonZero::Size_MatType::(127x61, 8UC1)	0.000	0.000	0.97
countNonZero::Size_MatType::(127x61, 8SC1)	0.000	0.000	0.99
countNonZero::Size_MatType::(127x61, 16UC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 16SC1)	0.000	0.000	1.00
countNonZero::Size_MatType::(127x61, 32SC1)	0.001	0.001	1.02
countNonZero::Size_MatType::(127x61, 32FC1)	0.001	0.001	1.00
countNonZero::Size_MatType::(127x61, 64FC1)	0.002	0.001	1.87
countNonZero::Size_MatType::(640x480, 8UC1)	0.012	0.012	1.05
countNonZero::Size_MatType::(640x480, 8SC1)	0.011	0.012	0.95
countNonZero::Size_MatType::(640x480, 16UC1)	0.014	0.015	0.95
countNonZero::Size_MatType::(640x480, 16SC1)	0.015	0.014	1.03
countNonZero::Size_MatType::(640x480, 32SC1)	0.033	0.033	1.00
countNonZero::Size_MatType::(640x480, 32FC1)	0.033	0.034	0.97
countNonZero::Size_MatType::(640x480, 64FC1)	0.100	0.088	1.13
countNonZero::Size_MatType::(1280x720, 8UC1)	0.038	0.038	1.00
countNonZero::Size_MatType::(1280x720, 8SC1)	0.037	0.038	0.98
countNonZero::Size_MatType::(1280x720, 16UC1)	0.066	0.065	1.02
countNonZero::Size_MatType::(1280x720, 16SC1)	0.066	0.065	1.02
countNonZero::Size_MatType::(1280x720, 32SC1)	0.130	0.133	0.97
countNonZero::Size_MatType::(1280x720, 32FC1)	0.130	0.133	0.98
countNonZero::Size_MatType::(1280x720, 64FC1)	0.306	0.278	1.10
countNonZero::Size_MatType::(1920x1080, 8UC1)	0.086	0.086	1.00
countNonZero::Size_MatType::(1920x1080, 8SC1)	0.084	0.085	0.98
countNonZero::Size_MatType::(1920x1080, 16UC1)	0.150	0.149	1.00
countNonZero::Size_MatType::(1920x1080, 16SC1)	0.149	0.149	1.00
countNonZero::Size_MatType::(1920x1080, 32SC1)	0.316	0.318	1.00
countNonZero::Size_MatType::(1920x1080, 32FC1)	0.317	0.317	1.00
countNonZero::Size_MatType::(1920x1080, 64FC1)	1.054	0.880	1.20

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
countNonZero::Size_MatType::(127x61, 8UC1)	0.000	0.000	0.99
countNonZero::Size_MatType::(127x61, 8SC1)	0.000	0.000	1.02
countNonZero::Size_MatType::(127x61, 16UC1)	0.000	0.000	1.03
countNonZero::Size_MatType::(127x61, 16SC1)	0.000	0.000	1.03
countNonZero::Size_MatType::(127x61, 32SC1)	0.001	0.001	1.00
countNonZero::Size_MatType::(127x61, 32FC1)	0.001	0.001	1.02
countNonZero::Size_MatType::(127x61, 64FC1)	0.003	0.001	3.08
countNonZero::Size_MatType::(640x480, 8UC1)	0.007	0.007	1.01
countNonZero::Size_MatType::(640x480, 8SC1)	0.007	0.007	1.01
countNonZero::Size_MatType::(640x480, 16UC1)	0.012	0.011	1.04
countNonZero::Size_MatType::(640x480, 16SC1)	0.012	0.011	1.02
countNonZero::Size_MatType::(640x480, 32SC1)	0.033	0.036	0.91
countNonZero::Size_MatType::(640x480, 32FC1)	0.033	0.037	0.89
countNonZero::Size_MatType::(640x480, 64FC1)	0.123	0.091	1.35
countNonZero::Size_MatType::(1280x720, 8UC1)	0.022	0.022	0.99
countNonZero::Size_MatType::(1280x720, 8SC1)	0.022	0.022	0.99
countNonZero::Size_MatType::(1280x720, 16UC1)	0.066	0.069	0.95
countNonZero::Size_MatType::(1280x720, 16SC1)	0.065	0.070	0.94
countNonZero::Size_MatType::(1280x720, 32SC1)	0.131	0.138	0.95
countNonZero::Size_MatType::(1280x720, 32FC1)	0.132	0.141	0.94
countNonZero::Size_MatType::(1280x720, 64FC1)	0.384	0.289	1.33
countNonZero::Size_MatType::(1920x1080, 8UC1)	0.074	0.079	0.93
countNonZero::Size_MatType::(1920x1080, 8SC1)	0.073	0.079	0.93
countNonZero::Size_MatType::(1920x1080, 16UC1)	0.148	0.160	0.93
countNonZero::Size_MatType::(1920x1080, 16SC1)	0.148	0.158	0.93
countNonZero::Size_MatType::(1920x1080, 32SC1)	0.317	0.339	0.94
countNonZero::Size_MatType::(1920x1080, 32FC1)	0.322	0.340	0.95
countNonZero::Size_MatType::(1920x1080, 64FC1)	1.199	0.908	1.32

ChipKerchner reviewed Oct 11, 2019

View reviewed changes

core: vectorize countNonZero64f

ec91a3d

Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake x86-64.

pmur force-pushed the cnz64f-simd branch from 0a0b550 to ec91a3d Compare October 11, 2019 14:03

terfendail approved these changes Oct 18, 2019

View reviewed changes

alalek assigned terfendail Oct 18, 2019

opencv-pushbot pushed a commit that referenced this pull request Oct 18, 2019

Merge pull request #15685 from pmur:cnz64f-simd

938d8dc

opencv-pushbot merged commit ec91a3d into opencv:3.4 Oct 18, 2019

alalek mentioned this pull request Oct 24, 2019

Merge 3.4 #15771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: vectorize countNonZero64f#15685

core: vectorize countNonZero64f#15685
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:cnz64f-simd

pmur commented Oct 10, 2019 •

edited by alalek

Loading

Uh oh!

ChipKerchner Oct 11, 2019 •

edited

Loading

Uh oh!

alalek Oct 11, 2019 •

edited

Loading

Uh oh!

ChipKerchner Oct 11, 2019

Uh oh!

alalek Oct 11, 2019

Uh oh!

pmur Oct 11, 2019

Uh oh!

alalek Oct 11, 2019

Uh oh!

terfendail commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

pmur commented Oct 10, 2019 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This pullrequest changes

Uh oh!

ChipKerchner Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChipKerchner Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

alalek Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

pmur Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

alalek Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

terfendail commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmur commented Oct 10, 2019 •

edited by alalek

Loading

ChipKerchner Oct 11, 2019 •

edited

Loading

alalek Oct 11, 2019 •

edited

Loading