Skip to content

DNN: fixed bug in depthwise conv of stride 2#23162

Merged
opencv-pushbot merged 1 commit intoopencv:4.xfrom
zihaomu:issue_23151
Jan 28, 2023
Merged

DNN: fixed bug in depthwise conv of stride 2#23162
opencv-pushbot merged 1 commit intoopencv:4.xfrom
zihaomu:issue_23151

Conversation

@zihaomu
Copy link
Copy Markdown
Member

@zihaomu zihaomu commented Jan 20, 2023

Merge this PR with test data: opencv/opencv_extra#1041.
Related issue: #23151
Related PR: #22905

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@zihaomu zihaomu changed the title fixed bug in depthwise conv of stide 2 DNN: fixed bug in depthwise conv of stride 2 Jan 20, 2023
@zihaomu zihaomu linked an issue Jan 20, 2023 that may be closed by this pull request
4 tasks
@zihaomu zihaomu requested review from alalek and dkurt January 20, 2023 16:19
Copy link
Copy Markdown
Member

@dkurt dkurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 1729 to 1737
TEST_P(Test_ONNX_layers, DepthWiseConv)
{
testONNXModels("depthwiseconv_add");
testONNXModels("depthwise_stride2");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please separate test. Merging them is a bad practice.

  2. Looks like added test doesn't call the modified code.

  3. Also need to add test which triggers out_j <= pad_l condition from above (both cases on lines 126 and 95).

Copy link
Copy Markdown
Member Author

@zihaomu zihaomu Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code reviewing.
I have updated the test to trigger out_j <= pad_l more easily.

The difference is: the previous code will enter SIMD optimization, which leads to memory errors. Now, we will break when out_j <= pad_l is true (the test, out_j and pad_l are equal to 1).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!

Now dumped values are: out_j=1 pad_l=1 outW1=2 VEC_NLANES=4
So both parts of condition are still true:

if (out_j <= pad_l || outW1 - VEC_NLANES < 0)
    break;

Problem is that canonical pattern for vectorization with SIMD tail processing is the following:

    int j = 0;
    for ( ; j < len; j += VECSZ)
    {
        if (j > len - VECSZ)
        {
            if (j == 0)
                break;
            j = len - VECSZ;  // "shifted" tail
        }
        ... processing ...
    }

(e.g., core / convert.simd.hpp)

Also we could move out "j == 0" inner check and get a better code (in almost all cases):

    int j = 0;
    if (len < VECSZ)  // data is too small
        return j;  // or skip the loop
    for ( ; j < len; j += VECSZ)
    {
        if (j > len - VECSZ)
        {
            j = len - VECSZ;  // "shifted" tail
        }
        ... processing ...
    }

This code is verified and approved for using. Need to completely understand this code pattern.

Existence of other extra checks is very suspicious in such code.


Now return to used tail handling part:

if (out_j + VEC_NLANES > outW1)
{
    if (out_j <= pad_l || outW1 - VEC_NLANES < 0)
        break;
    out_j = outW1 - VEC_NLANES;
}
  1. out_j + VEC_NLANES > outW1 could be replaced to out_j > outW1 - VEC_NLANES, because out_j + VEC_NLANES is not a constant, but outW1 - VEC_NLANES is a constant for the loop (and could be computed once). We could assume that out_j + VEC_NLANES is used for the next loop increment but this may be not true if processing part is large (and there are not enough registers to store this temporary value) or compiler is not smart enough.

  2. outW1 - VEC_NLANES < 0 is similar to len < VEC_NLANES

  3. out_j is not always start from 0. It could start from 1 (line 65). But not from pad_l. So out_j <= pad_l check looks suspicious.

  4. I believe we don't need both checks. We should have some modified one according to offset.


depthwise_convolution.cpp

There is mess of SIMD and SIMD-dispatcher code.

  • SIMD + reference/baseline code must go to .simd.hpp
  • dispatcher code must be in a separate file like it is done in core/imgproc and even dnn(layers_common.simd.hpp) modules
  • there is dispatching based on instruction set(!) on fastDepthwiseConv optimization of another algorithm (that should not happen, because this is a mess - @vpisarev)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alalek, I have updated the code and replaced the check of out_j <= pad_l with in_j < 0.

In the following code:

if (out_j > outW1 - VEC_NLANES)
    out_j = outW1 - VEC_NLANES;

int in_j = out_j * stride_w - pad_l;

When out_j > outW1 - VEC_NLANES is true, out_j maybe 0. And if the pad_l is 1, the in_j may be < 0. And this will cause a memory error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatcher code must be in a separate file like it is done in core/imgproc and even dnn(layers_common.simd.hpp) modules

This is something I've been wanting to optimize for a while and hopes to finish it this week in another PR.

@alalek
Copy link
Copy Markdown
Member

alalek commented Jan 23, 2023

FYI, Test_Int8_nets.EfficientDet/0 test is broken on some configurations and it is not related to this patch - skipped in #23167

@dkurt dkurt mentioned this pull request Jan 24, 2023
6 tasks
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add assumptions checks for input parameters of this code in the beginning of function.

E.g.:

CV_DbgAssert(pad_l ...);

(just because lines 32 and 65 don't correlate - tail and head processing are different).

v_float32x4 vrc = v_setall_f32(relu_coeff);

if (stride_w == 1 || (stride_w == 2 && dilation_w == 1))
if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 >= VEC_NLANES)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outW1 >= VEC_NLANES

If this is a guard for SIMD code, then it should be applied early.
There are 10+ statements with unused/dropped result.

v_float32x4 vrc = v_setall_f32(relu_coeff);

if (stride_w == 1 || (stride_w == 2 && dilation_w == 1))
if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 >= VEC_NLANES)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem of this optimization.

Line 72:

outW1 = max(out_j, outW1 - outW1%VEC_NLANES);

This code is under #if CV_SIMD but its result affects behavior of "scalar" code below (Line 155):

#endif

        for (; out_j < outW1; out_j++)

(at the same time we could execute or not execute of any SIMD code)

Comment on lines +97 to +99

if (in_j < 0)
break;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That SIMD bailout code looks strange.
There is no guarantee that out_j value is valid and consistent after that. But that value is used in scalar code below (after #endif), so this is critical:

for (; out_j < outW1; out_j++)

Requirement: "Scalar" code should NOT recompute the same values again after SIMD code.


What values of input parameters trigger that case? (here and below with stride=2)
(currently this bailout is not triggered by tests)

Also there is no similar check in "scalar" part.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alalek thanks for your suggestion, I tried simplifying the code further.

@zihaomu zihaomu force-pushed the issue_23151 branch 2 times, most recently from f6802f4 to 47de6ea Compare January 27, 2023 12:57

const int VEC_NLANES = 4;
if (fusedAdd)
outW1 = min(outW, outW1 - outW1%VEC_NLANES);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next questions (to show why this doesn't look like a valid change):

  • why do we have that outsize of CV_SIMD?
  • why VEC_NLANES = 4. what if we optimize for SIMD256?

Original question was why do we change range of scalar for loops in the SIMD code (which could be enabled or disabled).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. The code has been updated.

v_float32x4 vrc = v_setall_f32(relu_coeff);

if (stride_w == 1 || (stride_w == 2 && dilation_w == 1))
if ((stride_w == 1 || (stride_w == 2 && dilation_w == 1)) && outW1 > VEC_NLANES)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outW1 > VEC_NLANES

Normally we should use outW1 >= VEC_NLANES check here.

But there is "optional" range offset out_j = 1 (line 65), so condition should look like:

(outW1 - out_j) >= VEC_NLANES

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

@alalek
Copy link
Copy Markdown
Member

alalek commented Jan 28, 2023

All checks have passed
1 successful check

@asmorkalov Any ideas where are GHA checks?


Looks like workflows are broken: https://github.com/opencv/opencv/actions/runs/4030029158
(but green "no status" looks weird)

Invalid workflow file: .github/workflows/PR-4.x.yaml#L26

error parsing called workflow
".github/workflows/PR-4.x.yaml"
-> "opencv/ci-gha-workflow/.github/workflows/OCV-PR-4.x-macOS-ARM64.yaml@main" (source branch with sha:473de38475011ef39c0e79d66ee39ee147e9f7a2)
: You have an error in your yaml syntax on line 43

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thank you!

@opencv-pushbot opencv-pushbot merged commit cd44aa0 into opencv:4.x Jan 28, 2023
@alalek alalek mentioned this pull request Jan 28, 2023
@asmorkalov asmorkalov mentioned this pull request May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid read at Depthwise convolution with stride 2 and Winograd

4 participants