DNN: fixed bug in depthwise conv of stride 2#23162
Conversation
| TEST_P(Test_ONNX_layers, DepthWiseConv) | ||
| { | ||
| testONNXModels("depthwiseconv_add"); | ||
| testONNXModels("depthwise_stride2"); | ||
| } |
There was a problem hiding this comment.
-
Please separate test. Merging them is a bad practice.
-
Looks like added test doesn't call the modified code.
-
Also need to add test which triggers
out_j <= pad_lcondition from above (both cases on lines 126 and 95).
There was a problem hiding this comment.
Thanks for the code reviewing.
I have updated the test to trigger out_j <= pad_l more easily.
The difference is: the previous code will enter SIMD optimization, which leads to memory errors. Now, we will break when out_j <= pad_l is true (the test, out_j and pad_l are equal to 1).
There was a problem hiding this comment.
Thank you for update!
Now dumped values are: out_j=1 pad_l=1 outW1=2 VEC_NLANES=4
So both parts of condition are still true:
if (out_j <= pad_l || outW1 - VEC_NLANES < 0)
break;
Problem is that canonical pattern for vectorization with SIMD tail processing is the following:
int j = 0;
for ( ; j < len; j += VECSZ)
{
if (j > len - VECSZ)
{
if (j == 0)
break;
j = len - VECSZ; // "shifted" tail
}
... processing ...
}
(e.g., core / convert.simd.hpp)
Also we could move out "j == 0" inner check and get a better code (in almost all cases):
int j = 0;
if (len < VECSZ) // data is too small
return j; // or skip the loop
for ( ; j < len; j += VECSZ)
{
if (j > len - VECSZ)
{
j = len - VECSZ; // "shifted" tail
}
... processing ...
}
This code is verified and approved for using. Need to completely understand this code pattern.
Existence of other extra checks is very suspicious in such code.
Now return to used tail handling part:
if (out_j + VEC_NLANES > outW1)
{
if (out_j <= pad_l || outW1 - VEC_NLANES < 0)
break;
out_j = outW1 - VEC_NLANES;
}
-
out_j + VEC_NLANES > outW1could be replaced toout_j > outW1 - VEC_NLANES, becauseout_j + VEC_NLANESis not a constant, butoutW1 - VEC_NLANESis a constant for the loop (and could be computed once). We could assume thatout_j + VEC_NLANESis used for the next loop increment but this may be not true if processing part is large (and there are not enough registers to store this temporary value) or compiler is not smart enough. -
outW1 - VEC_NLANES < 0is similar tolen < VEC_NLANES -
out_jis not always start from0. It could start from1(line 65). But not frompad_l. Soout_j <= pad_lcheck looks suspicious. -
I believe we don't need both checks. We should have some modified one according to offset.
depthwise_convolution.cpp
There is mess of SIMD and SIMD-dispatcher code.
- SIMD + reference/baseline code must go to
.simd.hpp - dispatcher code must be in a separate file like it is done in core/imgproc and even dnn(
layers_common.simd.hpp) modules - there is dispatching based on instruction set(!) on
fastDepthwiseConvoptimization of another algorithm (that should not happen, because this is a mess - @vpisarev)
There was a problem hiding this comment.
Hi @alalek, I have updated the code and replaced the check of out_j <= pad_l with in_j < 0.
In the following code:
if (out_j > outW1 - VEC_NLANES)
out_j = outW1 - VEC_NLANES;
int in_j = out_j * stride_w - pad_l;When out_j > outW1 - VEC_NLANES is true, out_j maybe 0. And if the pad_l is 1, the in_j may be < 0. And this will cause a memory error.
There was a problem hiding this comment.
dispatcher code must be in a separate file like it is done in core/imgproc and even dnn(layers_common.simd.hpp) modules
This is something I've been wanting to optimize for a while and hopes to finish it this week in another PR.
|
FYI, |
alalek
left a comment
There was a problem hiding this comment.
Please add assumptions checks for input parameters of this code in the beginning of function.
E.g.:
CV_DbgAssert(pad_l ...);
(just because lines 32 and 65 don't correlate - tail and head processing are different).
| v_float32x4 vrc = v_setall_f32(relu_coeff); | ||
|
|
||
| if (stride_w == 1 || (stride_w == 2 && dilation_w == 1)) | ||
| if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 >= VEC_NLANES) |
There was a problem hiding this comment.
outW1 >= VEC_NLANES
If this is a guard for SIMD code, then it should be applied early.
There are 10+ statements with unused/dropped result.
| v_float32x4 vrc = v_setall_f32(relu_coeff); | ||
|
|
||
| if (stride_w == 1 || (stride_w == 2 && dilation_w == 1)) | ||
| if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 >= VEC_NLANES) |
There was a problem hiding this comment.
Another problem of this optimization.
Line 72:
outW1 = max(out_j, outW1 - outW1%VEC_NLANES);
This code is under #if CV_SIMD but its result affects behavior of "scalar" code below (Line 155):
#endif
for (; out_j < outW1; out_j++)
(at the same time we could execute or not execute of any SIMD code)
|
|
||
| if (in_j < 0) | ||
| break; |
There was a problem hiding this comment.
That SIMD bailout code looks strange.
There is no guarantee that out_j value is valid and consistent after that. But that value is used in scalar code below (after #endif), so this is critical:
for (; out_j < outW1; out_j++)
Requirement: "Scalar" code should NOT recompute the same values again after SIMD code.
What values of input parameters trigger that case? (here and below with stride=2)
(currently this bailout is not triggered by tests)
Also there is no similar check in "scalar" part.
There was a problem hiding this comment.
Hi @alalek thanks for your suggestion, I tried simplifying the code further.
f6802f4 to
47de6ea
Compare
|
|
||
| const int VEC_NLANES = 4; | ||
| if (fusedAdd) | ||
| outW1 = min(outW, outW1 - outW1%VEC_NLANES); |
There was a problem hiding this comment.
Next questions (to show why this doesn't look like a valid change):
- why do we have that outsize of
CV_SIMD? - why
VEC_NLANES = 4. what if we optimize for SIMD256?
Original question was why do we change range of scalar for loops in the SIMD code (which could be enabled or disabled).
There was a problem hiding this comment.
I got your point. The code has been updated.
| v_float32x4 vrc = v_setall_f32(relu_coeff); | ||
|
|
||
| if (stride_w == 1 || (stride_w == 2 && dilation_w == 1)) | ||
| if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 > VEC_NLANES) |
There was a problem hiding this comment.
outW1 > VEC_NLANES
Normally we should use outW1 >= VEC_NLANES check here.
But there is "optional" range offset out_j = 1 (line 65), so condition should look like:
(outW1 - out_j) >= VEC_NLANES
@asmorkalov Any ideas where are GHA checks? Looks like workflows are broken: https://github.com/opencv/opencv/actions/runs/4030029158 Invalid workflow file: .github/workflows/PR-4.x.yaml#L26 |
Merge this PR with test data: opencv/opencv_extra#1041.
Related issue: #23151
Related PR: #22905
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.