Vectorize calculating integral for line for single and multiple channels#16556
Vectorize calculating integral for line for single and multiple channels#16556alalek merged 6 commits intoopencv:3.4from
Conversation
| prev = vx_setall_f64(v_extract_n<v_float64::nlanes - 1>(el4hh)); | ||
| // prev = v_broadcast_element<v_float64::nlanes - 1>(el4hh); |
There was a problem hiding this comment.
Why removed v_broadcast_element()?
There was a problem hiding this comment.
v_broadcast_element for v_float64 is not available for all platforms. Left this in for when they are added.
| } | ||
| }; | ||
|
|
||
| #if CV_SIMD128_64F && !CV_AVX512_SKX |
There was a problem hiding this comment.
Why is excluded CV_AVX512_SKX?
Do we want CV_SIMD_WIDTH <= 32 here instead?
There was a problem hiding this comment.
There is already a AVX512 version for doubles. See in above code.
| v_int32 prev_1 = vx_setzero_s32(), prev_2 = vx_setzero_s32(), | ||
| prev_3 = vx_setzero_s32(), prev_4 = vx_setzero_s32(); | ||
| int j = 0; | ||
| for ( ; j + v_uint16::nlanes * cn <= width; j += v_uint16::nlanes * cn) |
There was a problem hiding this comment.
The code looks over-complicated to me. IMO it would be better to process one vector at a time and reduce amount of shifts and additions starting with addition of element quads.
|
I've collected performance for the existing change on my setup Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
Performance for AVX512 baseline
|
|
I've tested single vector processing for 4-channel to 32S Performance is a bit better on my setup Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
Performance for AVX512 baseline
|
|
Looks like new way to vectorize 8UC1 to 64FC1 works better than existing AVX512 implementation. Performance for AVX512 baseline
|
Good find! I've implemented 8UC4->32SC4 and 8UC4->32FC4 so far and am seeing an additional 25-30% improvement. Let me know your ideas for 8UC1->64FC1 or if you'd just like to update with your ideas for the AVX512 version. I don't really have a way to test AVX512 currently. |
|
Regarding AVX512 I've meant that I've tested the generic version that is disabled at the moment for AVX512 instead of specialized |
|
I committed the changes for a single vector processing for 4-channels (8UC4->32SC4/32FC4/64FC4). I will look at similar changes for 2-channels when I have time (early testing shows speed to be similar to my version). If the 64FC1 and/or 64FC4 changes are faster than the AVX512 version, I will try to activate this version instead. Please make sure the AVX512 code (CV_SIMD_WIDTH > 32) is correct. Also if you can rerun the timings including AVX512, that would be useful. @terfendail, I think this smoke test is failing because of AVX512 (please suggest a fix since it is your code) - |
|
Sorry. That was my fault. I've missed the fact that v_zip interleaves channels. Performance for this version is almost the same Performance for AVX512 baseline
|
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
Performance for AVX512 baseline
It looks like there is small performance degradation for 8UC3->32S on SSE2 and SSE3 |
I'll have to think a little more about if there is a better way to do 8UC3->32S. For non-Intel platforms, this algorithm is much better than the scalar. Could you measure the performance of my version of 8UC[1-4]->64F versus the current (old) version for AVX512? I want to know if it worth calling the current old version at all. |
|
Performance is better for 8UC1->64F while is almost the same for 8UC[2-4](I've manually disabled existing AVX512 code dispatching and enabled new code for AVX512 platform as well) Performance for AVX512 baseline
|
|
What would be the best way to enable my 8UC1->64F for AVX512 but use the old code for 8UC[2-4]->64F? |
| double * sqsum, size_t, | ||
| double * tilted, size_t, | ||
| int width, int height, int cn) const | ||
| { |
There was a problem hiding this comment.
I think call to specific AVX512 implementation could be moved to the begging of this implementation with proper check for requested mode
#if CV_AVX512_SKX
if (!tilted && cn <= 4 && (cn > 1 || sqsum))
{
calculate_integral_avx512(src, _srcstep, sum, _sumstep, sqsum, _sqsumstep, width, height, cn);
return true;
}
#endif
…sion of 8UC1 to 64F for AVX512.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
Performance for AVX512 baseline
|
|
OOB access issue: #16708 |
Vectorize calculating integral for line for single and multiple channels - up to 2.75x faster.