-
-
Notifications
You must be signed in to change notification settings - Fork 56.5k
GaussianBlur() fails with SIMD and fill image with 0 #20666
Description
System information (version)
- OpenCV => 4.5.3
- Operating System / Platform => Ubuntu 18.04
- Compiler => gcc 7.5.0
Detailed description
Problem:
After I compiled openCV with CUDA support I tried to run unit tests to check my compiled binaries.
I found that 'opencv_test_cudafilters' had failed.
After run this single binary test files its reported more details with 'CUDA_Filters/GaussianBlur.Accuracy'.
Some pattern can be found in the failed 73 test items:
- only U16 type test cases are failed
- the output image has a border but the internal part is filled with 0
- only with special kernels has very low sigma values, causes kernel matrix 0 except at the middle of the matrix
Debugging the OpenCV code using gdb and cuda-gdb was found:
- the evaluation of the CUDA based gaussianBlur filter is based on comparison the output with the CPU based gaussianBlur output
- the allowed deviation is (int) 4 for every single image pixels
- surprisingly the problem is not with the CUDA calculated image output, rather than CPU based gaussianBlur is sometimes empty or just some pixels on the border have values
- the error is comes from where the Blur-sigma values is low, so the gaussianBlur kernel has only one values at the middle of the kernel
- kernel matrix calculation and its row/column vector calculations seems to be OK
Continue the debugging I found that the problem is in the function 'hlineSmoothONa_yzy_a()' at smooth.simd.hpp(1201).
The calculation of the border is correct, however the inner part of the image has a SIMD implemented acceleration. If I commented out this SIMD block, then it worked and the test passed. Then realized that the problem is with this line:
v_mul_expand(vx_load(src + pre_shift * cn), vx_setall_u16((uint16_t) ((uint32_t)(m + pre_shift))), v_res0, v_res1);
(unfortunately this line was modified in the pull request, when it was suggested to tightening the processing words from 32bit to 16 bits at: 6b75e4d by @terfendail)
In this case for U16 images, special kernel matrixes has vectors like this in this fuction:
[0,0,0,65536,0,0,0] (i.e. for 7x7 kernel)
But for this case of 'm + pre_shift' refer to the middle of 'm' vector, what is 65536. Cast (uint16_t) ((uint32_t)(m + pre_shift) will returns with 0 in this case what will set all the middle cells in the output matrix to 0.
Solution:
My solution is keep the SIMD processing throughput, while handle corner case of comes from narrowing kernel value type.
Check this special kernel case in the SIMD code (in the function 'hlineSmoothONa_yzy_a()' at smooth.simd.hpp(1236)):
if (*(m + pre_shift) == ufixedpoint32::fromRaw(1 << 16))
{
v_res0 = vx_load_expand(src + pre_shift * cn) << 16;
v_res1 = vx_load_expand(src + pre_shift * cn + VECSZ) << 16;
}
else
{
v_mul_expand(vx_load(src + pre_shift * cn), vx_setall_u16((uint16_t) *((uint32_t*)(m + pre_shift))), v_res0, v_res1);
}
Steps to reproduce
import cv2
import numpy as np
src=np.random.randint(low=255, size=(128,128), dtype=np.uint16)
dst=cv2.GaussianBlur(src, (7,7), sigmaX=0.1, sigmaY=0.1, borderType=cv2.BORDER_DEFAULT)
print("src:", src[1:10,1:10])
print("dst:", dst[1:10,1:10])
src: [[151 165 26 143 85 57 227 186 159]
[245 96 169 158 80 82 98 60 2]
[ 46 154 165 240 148 250 194 206 242]
[214 174 178 81 140 76 73 88 106]
[234 129 178 63 70 49 35 79 61]
[115 128 226 169 79 224 112 73 136]
[ 26 38 100 145 5 69 96 180 202]
[116 145 144 111 221 97 209 75 109]
[188 34 224 136 33 184 226 81 120]]
dst: [[151 165 0 0 0 0 0 0 0]
[245 96 0 0 0 0 0 0 0]
[ 46 154 0 0 0 0 0 0 0]
[214 174 0 0 0 0 0 0 0]
[234 129 0 0 0 0 0 0 0]
[115 128 0 0 0 0 0 0 0]
[ 26 38 0 0 0 0 0 0 0]
[116 145 0 0 0 0 0 0 0]
[188 34 0 0 0 0 0 0 0]]
Issue submission checklist
- I report the issue, it's not a question
- I checked the problem with documentation, FAQ, open issues,
forum.opencv.org, Stack Overflow, etc and have not found solution - I updated to latest OpenCV version and the issue is still there
- There is reproducer code and related data files: videos, images, onnx, etc