Skip to content

DNN: ARMv7 compatible fastConv#22183

Merged
alalek merged 2 commits intoopencv:4.xfrom
zihaomu:fastConv_ARMv7_compatible
Jul 7, 2022
Merged

DNN: ARMv7 compatible fastConv#22183
alalek merged 2 commits intoopencv:4.xfrom
zihaomu:fastConv_ARMv7_compatible

Conversation

@zihaomu
Copy link
Copy Markdown
Member

@zihaomu zihaomu commented Jul 3, 2022

This PR is compatible fastConv and winogradConv with ARMv7.
The previous #21910 PR only supported AARCH64 or ARMv8. And it has bugs on ARMv7
as @asenyaev reported.

closes #22188

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@zihaomu zihaomu requested review from alalek and vpisarev July 3, 2022 04:34
double confThreshold = 0.24;
double nmsThreshold = (target == DNN_TARGET_MYRIAD) ? 0.397 : 0.4;
double scoreDiff = 8e-5, iouDiff = 1e-5;
double scoreDiff = 1e-4, iouDiff = 1e-5;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arm instruction vfmaq_laneq_f32 is only supported at ARMv8. So I use the vmlaq_n_f32 instead. I adjusted the thresholds here and below because I found the vmlaq_n_f32 will generate very little different results than vmlaq_n_f32.

Copy link
Copy Markdown
Member Author

@zihaomu zihaomu Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my M1 chip, the vmlaq_n_f32 will be parsed as follows:

fmul.4s             v24, v22, v20[0]
fadd.4s             v3, v3, v24

And the vfmaq_laneq_f32 will be parsed as follows:

fmla.4s             v16, v2, v14[3]

I'm not sure if the FMA instruction (vmlaq_n_f32) can be parsed into a single arm assembly(vmla.f32 q8, q5, d0[1]) on ARMv7. If that's true, I believe we can do it without adjusting thresholds.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmlaq_lane_f32 is available on armv7, you can use it as the replacement of vfmaq_laneq_f32

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nihui, thanks for your suggestion. I have tested vmlaq_lane_f32 and vmlaq_n_f32 on my M1 mac, and I found that they were both parsed to follow assembly code:

fmul.4s             v24, v22, v20[0]
fadd.4s             v3, v3, v24

Do you mean the vmlaq_lane_f32 will be parsed to a single arm assembly(like vmla.f32 q8, q5, d0[1]) on ARMv7?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nihui, I have updated the code with vmlaq_lane_f32 and everything works fine. Big thanks!

@zihaomu
Copy link
Copy Markdown
Member Author

zihaomu commented Jul 4, 2022

Hi @alalek, I can't pass the ARMv7 CI. And from the CI's log, I have no idea how to fix it. Any advice?

double confThreshold = 0.24;
double nmsThreshold = (target == DNN_TARGET_MYRIAD) ? 0.397 : 0.4;
double scoreDiff = 8e-5, iouDiff = 1e-5;
double scoreDiff = 1e-4, iouDiff = 1e-5;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmlaq_lane_f32 is available on armv7, you can use it as the replacement of vfmaq_laneq_f32

@alalek
Copy link
Copy Markdown
Member

alalek commented Jul 4, 2022

@zihaomu Please ignore (ARMv7 configuration is not working on BuildBot)

@zihaomu zihaomu force-pushed the fastConv_ARMv7_compatible branch from a542fa7 to 057a32e Compare July 4, 2022 11:55
@zihaomu zihaomu force-pushed the fastConv_ARMv7_compatible branch from 057a32e to 7bfc1fe Compare July 4, 2022 14:33
float32x4_t r04 = r00, r05 = r00, r06 = r00, r07 = r00;
float32x4_t r08 = r00, r09 = r00, r10 = r00, r11 = r00;
float32x4_t r12 = r00, r13 = r00, r14 = r00, r15 = r00;
float32x2_t q00 = vdup_n_f32(0.0f), q01 = q00, q02 = q00, q03 = q00,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please explain why you use halves of NEON registers on ARMv7? ARMv7 still has 128-bit NEON registers, I don't see why not use all of them

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for code reviewing. As @nihui's commented, vmlaq_lane_f32 is the best substitute for vfmaq_laneq_f32 under the ARMv7 platform.
Another option is vmlaq_n_f32, it will be parsed into two arm assembly code fmul.4s v24, v22, v20[0] and fadd.4s v3, v3, v24. And vmlaq_lane_f32 will be parsed into only one arm assembly code vmla.f32 q8, q5, d0[1] on ARMv7.
At the same time, two consecutive half-length register loads will be converted into a 128bit load during loading data to register, so the data load time is the same.

@vpisarev vpisarev self-assigned this Jul 7, 2022
@vpisarev vpisarev self-requested a review July 7, 2022 01:51
@vpisarev
Copy link
Copy Markdown
Contributor

vpisarev commented Jul 7, 2022

👍

@vpisarev
Copy link
Copy Markdown
Contributor

vpisarev commented Jul 7, 2022

@nihui, btw, let me use this opportunity to thank you for ncnn, which we took some code from. ncnn is a real masterpiece 👍

@alalek alalek merged commit 139c443 into opencv:4.x Jul 7, 2022
@alalek alalek mentioned this pull request Aug 21, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
DNN: ARMv7 compatible fastConv

* support armv7 on fastConv

* remove whitespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug category: dnn optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compilation problems on Pi Gen 2 with Distribution Bullseye Full (32 bit)

4 participants