GaussianBlur() fails with SIMD and fill image with 0

##### System information (version)

- OpenCV => 4.5.3
- Operating System / Platform => Ubuntu 18.04
- Compiler => gcc 7.5.0

##### Detailed description

Problem:
------------
After I compiled openCV with CUDA support I tried to run unit tests to check my compiled binaries. 
I found that 'opencv_test_cudafilters' had failed.
After run this single binary test files its reported more details with 'CUDA_Filters/GaussianBlur.Accuracy'.
Some pattern can be found in the failed 73 test items: 
- only U16 type test cases are failed 
- the output image has a border but the internal part is filled with 0
- only with special kernels has very low sigma values, causes kernel matrix 0 except at the middle of the matrix

Debugging the OpenCV code using gdb and cuda-gdb was found:
   - the evaluation of the CUDA based gaussianBlur filter is based on comparison the output with the CPU based gaussianBlur output
   - the allowed deviation is (int) 4 for every single image pixels
   - surprisingly the problem is not with the CUDA calculated image output, rather than CPU based gaussianBlur is sometimes empty or just some pixels on the border have values
   - the error is comes from where the Blur-sigma values is low, so the gaussianBlur kernel has only one values at the middle of the kernel
   - kernel matrix calculation and its row/column vector calculations seems to be OK
 
Continue the debugging I found that the problem is in the function 'hlineSmoothONa_yzy_a()' at smooth.simd.hpp(1201).
The calculation of the border is correct, however the inner part of the image has a SIMD implemented acceleration. If I commented out this SIMD block, then it worked and the test passed. Then realized that the problem is with this line:

v_mul_expand(vx_load(src + pre_shift * cn), vx_setall_u16((uint16_t) *((uint32_t*)(m + pre_shift))), v_res0, v_res1);

(unfortunately this line was modified in the pull request, when it was suggested to tightening the processing words from 32bit to 16 bits at: https://github.com/opencv/opencv/pull/18983/commits/6b75e4ddd697442b42cb6206bf8735e06336221b by @terfendail)

In this case for U16 images, special kernel matrixes has vectors like this in this fuction:
[0,0,0,65536,0,0,0] (i.e. for 7x7 kernel)

But for this case of 'm + pre_shift'  refer to the middle of 'm' vector, what is 65536. Cast (uint16_t) *((uint32_t*)(m + pre_shift) will returns with 0 in this case what will set all the middle cells in the output matrix to 0.

Solution:
-----------
My solution is keep the SIMD processing throughput, while handle corner case of comes from narrowing kernel value type.
Check this special kernel case in the SIMD code (in the function 'hlineSmoothONa_yzy_a()' at smooth.simd.hpp(1236)):
```
        if (*(m + pre_shift) == ufixedpoint32::fromRaw(1 << 16))
        {
            v_res0 = vx_load_expand(src + pre_shift * cn) << 16;
            v_res1 = vx_load_expand(src + pre_shift * cn + VECSZ) << 16;
        }
        else
        {
            v_mul_expand(vx_load(src + pre_shift * cn), vx_setall_u16((uint16_t) *((uint32_t*)(m + pre_shift))), v_res0, v_res1);
        }
```

##### Steps to reproduce

import cv2
import numpy as np
src=np.random.randint(low=255, size=(128,128), dtype=np.uint16)
dst=cv2.GaussianBlur(src, (7,7), sigmaX=0.1, sigmaY=0.1, borderType=cv2.BORDER_DEFAULT)
print("src:", src[1:10,1:10])
print("dst:", dst[1:10,1:10])

src: [[151 165  26 143  85  57 227 186 159]
 [245  96 169 158  80  82  98  60   2]
 [ 46 154 165 240 148 250 194 206 242]
 [214 174 178  81 140  76  73  88 106]
 [234 129 178  63  70  49  35  79  61]
 [115 128 226 169  79 224 112  73 136]
 [ 26  38 100 145   5  69  96 180 202]
 [116 145 144 111 221  97 209  75 109]
 [188  34 224 136  33 184 226  81 120]]
dst: [[151 165   0   0   0   0   0   0   0]
 [245  96   0   0   0   0   0   0   0]
 [ 46 154   0   0   0   0   0   0   0]
 [214 174   0   0   0   0   0   0   0]
 [234 129   0   0   0   0   0   0   0]
 [115 128   0   0   0   0   0   0   0]
 [ 26  38   0   0   0   0   0   0   0]
 [116 145   0   0   0   0   0   0   0]
 [188  34   0   0   0   0   0   0   0]]

##### Issue submission checklist

 - [X] I report the issue, it's not a question
 - [X] I checked the problem with documentation, FAQ, open issues,
       forum.opencv.org, Stack Overflow, etc and have not found solution
 - [X] I updated to latest OpenCV version and the issue is still there
 - [X] There is reproducer code and related data files: videos, images, onnx, etc


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GaussianBlur() fails with SIMD and fill image with 0 #20666

System information (version)

Detailed description

Problem:

Solution:

Steps to reproduce

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GaussianBlur() fails with SIMD and fill image with 0 #20666

Description

System information (version)

Detailed description

Problem:

Solution:

Steps to reproduce

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions