Conversation
modules/dnn/test/test_model.cpp
Outdated
| double confThreshold = 0.24; | ||
| double nmsThreshold = (target == DNN_TARGET_MYRIAD) ? 0.397 : 0.4; | ||
| double scoreDiff = 8e-5, iouDiff = 1e-5; | ||
| double scoreDiff = 1e-4, iouDiff = 1e-5; |
There was a problem hiding this comment.
The arm instruction vfmaq_laneq_f32 is only supported at ARMv8. So I use the vmlaq_n_f32 instead. I adjusted the thresholds here and below because I found the vmlaq_n_f32 will generate very little different results than vmlaq_n_f32.
There was a problem hiding this comment.
On my M1 chip, the vmlaq_n_f32 will be parsed as follows:
fmul.4s v24, v22, v20[0]
fadd.4s v3, v3, v24
And the vfmaq_laneq_f32 will be parsed as follows:
fmla.4s v16, v2, v14[3]
I'm not sure if the FMA instruction (vmlaq_n_f32) can be parsed into a single arm assembly(vmla.f32 q8, q5, d0[1]) on ARMv7. If that's true, I believe we can do it without adjusting thresholds.
There was a problem hiding this comment.
vmlaq_lane_f32 is available on armv7, you can use it as the replacement of vfmaq_laneq_f32
There was a problem hiding this comment.
Hi @nihui, thanks for your suggestion. I have tested vmlaq_lane_f32 and vmlaq_n_f32 on my M1 mac, and I found that they were both parsed to follow assembly code:
fmul.4s v24, v22, v20[0]
fadd.4s v3, v3, v24
Do you mean the vmlaq_lane_f32 will be parsed to a single arm assembly(like vmla.f32 q8, q5, d0[1]) on ARMv7?
There was a problem hiding this comment.
Hi @nihui, I have updated the code with vmlaq_lane_f32 and everything works fine. Big thanks!
modules/dnn/test/test_model.cpp
Outdated
| double confThreshold = 0.24; | ||
| double nmsThreshold = (target == DNN_TARGET_MYRIAD) ? 0.397 : 0.4; | ||
| double scoreDiff = 8e-5, iouDiff = 1e-5; | ||
| double scoreDiff = 1e-4, iouDiff = 1e-5; |
There was a problem hiding this comment.
vmlaq_lane_f32 is available on armv7, you can use it as the replacement of vfmaq_laneq_f32
|
@zihaomu Please ignore (ARMv7 configuration is not working on BuildBot) |
a542fa7 to
057a32e
Compare
057a32e to
7bfc1fe
Compare
| float32x4_t r04 = r00, r05 = r00, r06 = r00, r07 = r00; | ||
| float32x4_t r08 = r00, r09 = r00, r10 = r00, r11 = r00; | ||
| float32x4_t r12 = r00, r13 = r00, r14 = r00, r15 = r00; | ||
| float32x2_t q00 = vdup_n_f32(0.0f), q01 = q00, q02 = q00, q03 = q00, |
There was a problem hiding this comment.
could you please explain why you use halves of NEON registers on ARMv7? ARMv7 still has 128-bit NEON registers, I don't see why not use all of them
There was a problem hiding this comment.
Thanks for code reviewing. As @nihui's commented, vmlaq_lane_f32 is the best substitute for vfmaq_laneq_f32 under the ARMv7 platform.
Another option is vmlaq_n_f32, it will be parsed into two arm assembly code fmul.4s v24, v22, v20[0] and fadd.4s v3, v3, v24. And vmlaq_lane_f32 will be parsed into only one arm assembly code vmla.f32 q8, q5, d0[1] on ARMv7.
At the same time, two consecutive half-length register loads will be converted into a 128bit load during loading data to register, so the data load time is the same.
|
👍 |
|
@nihui, btw, let me use this opportunity to thank you for ncnn, which we took some code from. ncnn is a real masterpiece 👍 |
DNN: ARMv7 compatible fastConv * support armv7 on fastConv * remove whitespace.
This PR is compatible
fastConvandwinogradConvwith ARMv7.The previous #21910 PR only supported AARCH64 or ARMv8. And it has bugs on ARMv7
as @asenyaev reported.
closes #22188
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.