Skip to content

Further optimize fastDepthwiseConv for RISC-V Vector.#25361

Merged
asmorkalov merged 1 commit intoopencv:4.xfrom
hanliutong:rvv-f32
Apr 9, 2024
Merged

Further optimize fastDepthwiseConv for RISC-V Vector.#25361
asmorkalov merged 1 commit intoopencv:4.xfrom
hanliutong:rvv-f32

Conversation

@hanliutong
Copy link
Copy Markdown
Contributor

This patch optimize fastDepthwiseConv in the f32 layer by using RVV Native Intrinsic.

This patch was tested on QEMU using VLEN=128 and VLEN=256 (./bin/opencv_test_dnn), both GCC (trunk) and Clang (16.0.6) are passed;

On the real device (k230, VLEN=128, Clang 16.0.6), For valid test cases in conv::Conv_Depthwise, opencv_perf_dnn showed the average acceleration of 5.74x for strided cases (benefit from using stride load instead of buffer) and 1.55x for non-strided cases (benefit from using more vector registers, m2 -> m8).

Test Result ( in descending order of acceleration ratio)
Name of Test origin optimized vs
conv::Conv_Depthwise::(GFLOPS=0.625, K=[3 x 3], IN={1, 32, 368, 368}, OCN=32, G=32, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 123.265 18.761 6.57
conv::Conv_Depthwise::(GFLOPS=0.076, K=[3 x 3], IN={1, 32, 128, 128}, OCN=32, G=32, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 14.899 2.321 6.42
conv::Conv_Depthwise::(GFLOPS=0.063, K=[3 x 3], IN={1, 144, 28, 28}, OCN=144, G=144, S=[2 x 2], BIAS, OCV/CPU) 3.416 0.538 6.35
conv::Conv_Depthwise::(GFLOPS=0.231, K=[3 x 3], IN={1, 64, 112, 112}, OCN=64, G=64, S=[2 x 2], P=[1 x 1], OCV/CPU) 23.177 3.712 6.24
conv::Conv_Depthwise::(GFLOPS=1.130, K=[3 x 3], IN={1, 144, 112, 112}, OCN=144, G=144, S=[2 x 2], BIAS, OCV/CPU) 50.557 8.358 6.05
conv::Conv_Depthwise::(GFLOPS=0.549, K=[3 x 3], IN={1, 120, 92, 92}, OCN=120, G=120, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 30.882 5.158 5.99
conv::Conv_Depthwise::(GFLOPS=0.415, K=[3 x 3], IN={1, 64, 150, 150}, OCN=64, G=64, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 45.272 7.571 5.98
conv::Conv_Depthwise::(GFLOPS=0.426, K=[3 x 3], IN={1, 128, 75, 75}, OCN=128, G=128, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 23.347 4.409 5.3
conv::Conv_Depthwise::(GFLOPS=0.426, K=[3 x 3], IN={1, 256, 38, 38}, OCN=256, G=256, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 13.986 2.724 5.13
conv::Conv_Depthwise::(GFLOPS=0.231, K=[3 x 3], IN={1, 128, 56, 56}, OCN=128, G=128, S=[2 x 2], P=[1 x 1], OCV/CPU) 13.84 2.721 5.09
conv::Conv_Depthwise::(GFLOPS=0.096, K=[3 x 3], IN={1, 144, 32, 32}, OCN=144, G=144, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 4.554 0.908 5.01
conv::Conv_Depthwise::(GFLOPS=0.231, K=[3 x 3], IN={1, 256, 28, 28}, OCN=256, G=256, S=[2 x 2], P=[1 x 1], OCV/CPU) 7.082 1.48 4.79
conv::Conv_Depthwise::(GFLOPS=1.889, K=[3 x 3], IN={1, 64, 160, 160}, OCN=64, G=64, P=[1 x 1], BIAS, OCV/CPU) 32.857 11.918 2.76
conv::Conv_Depthwise::(GFLOPS=4.301, K=[3 x 3], IN={1, 336, 46, 46}, OCN=336, G=336, P=[1 x 1], BIAS, OCV/CPU) 14.647 7.641 1.92
conv::Conv_Depthwise::(GFLOPS=0.076, K=[3 x 3], IN={1, 32, 64, 64}, OCN=32, G=32, P=[1 x 1], BIAS, OCV/CPU) 1.926 1.009 1.91
conv::Conv_Depthwise::(GFLOPS=0.680, K=[3 x 3], IN={1, 96, 64, 64}, OCN=96, G=96, P=[1 x 1], BIAS, OCV/CPU) 5.382 2.91 1.85
conv::Conv_Depthwise::(GFLOPS=0.019, K=[3 x 3], IN={1, 16, 64, 64}, OCN=16, G=16, P=[1 x 1], BIAS, OCV/CPU) 0.85 0.476 1.78
conv::Conv_Depthwise::(GFLOPS=0.019, K=[3 x 3], IN={1, 8, 128, 128}, OCN=8, G=8, P=[1 x 1], BIAS, OCV/CPU) 1.762 1.006 1.75
conv::Conv_Depthwise::(GFLOPS=0.157, K=[3 x 3], IN={1, 8, 368, 368}, OCN=8, G=8, P=[1 x 1], BIAS, OCV/CPU) 12.805 7.309 1.75
conv::Conv_Depthwise::(GFLOPS=0.012, K=[3 x 3], IN={1, 10, 80, 80}, OCN=10, G=10, P=[1 x 1], BIAS, OCV/CPU) 0.861 0.495 1.74
conv::Conv_Depthwise::(GFLOPS=0.473, K=[3 x 3], IN={1, 16, 320, 320}, OCN=16, G=16, P=[1 x 1], BIAS, OCV/CPU) 19.117 11.031 1.73
conv::Conv_Depthwise::(GFLOPS=0.118, K=[3 x 3], IN={1, 16, 160, 160}, OCN=16, G=16, P=[1 x 1], BIAS, OCV/CPU) 5.081 3.036 1.67
conv::Conv_Depthwise::(GFLOPS=0.472, K=[3 x 3], IN={1, 64, 80, 80}, OCN=64, G=64, P=[1 x 1], BIAS, OCV/CPU) 5.664 3.424 1.65
conv::Conv_Depthwise::(GFLOPS=0.170, K=[3 x 3], IN={1, 24, 128, 128}, OCN=24, G=24, P=[1 x 1], BIAS, OCV/CPU) 4.622 2.812 1.64
conv::Conv_Depthwise::(GFLOPS=0.473, K=[3 x 3], IN={1, 32, 160, 160}, OCN=32, G=32, P=[1 x 1], BIAS, OCV/CPU) 10.365 6.334 1.64
conv::Conv_Depthwise::(GFLOPS=0.925, K=[3 x 3], IN={1, 128, 56, 56}, OCN=128, G=128, P=[1 x 1], OCV/CPU) 5.251 3.216 1.63
conv::Conv_Depthwise::(GFLOPS=0.011, K=[3 x 3], IN={1, 24, 32, 32}, OCN=24, G=24, P=[1 x 1], BIAS, OCV/CPU) 0.36 0.222 1.62
conv::Conv_Depthwise::(GFLOPS=0.232, K=[3 x 3], IN={1, 32, 112, 112}, OCN=32, G=32, P=[1 x 1], BIAS, OCV/CPU) 5.026 3.224 1.56
conv::Conv_Depthwise::(GFLOPS=1.660, K=[3 x 3], IN={1, 128, 75, 75}, OCN=128, G=128, P=[1 x 1], BIAS, OCV/CPU) 9.81 6.322 1.55
conv::Conv_Depthwise::(GFLOPS=0.002, K=[3 x 3], IN={1, 4, 80, 80}, OCN=4, G=4, P=[1 x 1], BIAS, OCV/CPU) 0.302 0.196 1.54
conv::Conv_Depthwise::(GFLOPS=0.076, K=[3 x 3], IN={1, 8, 256, 256}, OCN=8, G=8, P=[1 x 1], BIAS, OCV/CPU) 5.59 3.626 1.54
conv::Conv_Depthwise::(GFLOPS=0.976, K=[3 x 3], IN={1, 40, 184, 184}, OCN=40, G=40, P=[1 x 1], BIAS, OCV/CPU) 14.918 9.709 1.54
conv::Conv_Depthwise::(GFLOPS=0.415, K=[3 x 3], IN={1, 32, 150, 150}, OCN=32, G=32, P=[1 x 1], BIAS, OCV/CPU) 8.533 5.587 1.53
conv::Conv_Depthwise::(GFLOPS=0.925, K=[3 x 3], IN={1, 256, 28, 28}, OCN=256, G=256, P=[1 x 1], OCV/CPU) 3.397 2.268 1.5
conv::Conv_Depthwise::(GFLOPS=0.130, K=[3 x 3], IN={1, 24, 112, 112}, OCN=24, G=24, P=[1 x 1], BIAS, OCV/CPU) 3.554 2.431 1.46
conv::Conv_Depthwise::(GFLOPS=2.194, K=[3 x 3], IN={1, 240, 46, 46}, OCN=240, G=240, P=[1 x 1], BIAS, OCV/CPU) 7.254 4.954 1.46
conv::Conv_Depthwise::(GFLOPS=0.030, K=[3 x 3], IN={1, 64, 20, 20}, OCN=64, G=64, P=[1 x 1], BIAS, OCV/CPU) 0.495 0.361 1.37
conv::Conv_Depthwise::(GFLOPS=0.351, K=[3 x 3], IN={1, 96, 46, 46}, OCN=96, G=96, P=[1 x 1], BIAS, OCV/CPU) 2.764 2.016 1.37
conv::Conv_Depthwise::(GFLOPS=0.001, K=[3 x 3], IN={1, 10, 20, 20}, OCN=10, G=10, P=[1 x 1], BIAS, OCV/CPU) 0.084 0.064 1.31
conv::Conv_Depthwise::(GFLOPS=0.000, K=[3 x 3], IN={1, 4, 20, 20}, OCN=4, G=4, P=[1 x 1], BIAS, OCV/CPU) 0.037 0.029 1.27
conv::Conv_Depthwise::(GFLOPS=0.003, K=[3 x 3], IN={1, 10, 40, 40}, OCN=10, G=10, P=[1 x 1], BIAS, OCV/CPU) 0.218 0.172 1.27
conv::Conv_Depthwise::(GFLOPS=1.704, K=[3 x 3], IN={1, 256, 38, 38}, OCN=256, G=256, P=[1 x 1], BIAS, OCV/CPU) 5.349 4.273 1.25
conv::Conv_Depthwise::(GFLOPS=0.000, K=[3 x 3], IN={1, 4, 40, 40}, OCN=4, G=4, P=[1 x 1], BIAS, OCV/CPU) 0.093 0.075 1.24
conv::Conv_Depthwise::(GFLOPS=0.118, K=[3 x 3], IN={1, 64, 40, 40}, OCN=64, G=64, P=[1 x 1], BIAS, OCV/CPU) 1.457 1.184 1.23
conv::Conv_Depthwise::(GFLOPS=0.003, K=[3 x 3], IN={1, 192, 2, 2}, OCN=192, G=192, P=[1 x 1], BIAS, OCV/CPU) 0.092 0.091 1.01
conv::Conv_Depthwise::(GFLOPS=0.000, K=[3 x 3], IN={1, 1, 20, 20}, OCN=1, P=[1 x 1], BIAS, OCV/CPU) 0.109 0.11 1
conv::Conv_Depthwise::(GFLOPS=0.000, K=[3 x 3], IN={1, 1, 40, 40}, OCN=1, P=[1 x 1], BIAS, OCV/CPU) 0.353 0.354 1
conv::Conv_Depthwise::(GFLOPS=0.000, K=[3 x 3], IN={1, 1, 80, 80}, OCN=1, P=[1 x 1], BIAS, OCV/CPU) 1.288 1.287 1
conv::Conv_Depthwise::(GFLOPS=0.001, K=[3 x 3], IN={1, 192, 4, 4}, OCN=192, G=192, S=[2 x 2], BIAS, OCV/CPU) 0.063 0.063 1
conv::Conv_Depthwise::(GFLOPS=0.004, K=[3 x 3], IN={1, 1, 32, 100}, OCN=64, P=[1 x 1], BIAS, OCV/CPU) 5.897 5.899 1
conv::Conv_Depthwise::(GFLOPS=0.014, K=[3 x 3], IN={1, 56, 16, 16}, OCN=56, G=56, P=[1 x 1], BIAS, OCV/CPU) 0.693 0.691 1
conv::Conv_Depthwise::(GFLOPS=0.082, K=[5 x 5], IN={1, 256, 12, 12}, OCN=256, G=256, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 2.182 2.184 1
conv::Conv_Depthwise::(GFLOPS=0.099, K=[5 x 5], IN={1, 128, 24, 24}, OCN=128, G=128, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 4.254 4.248 1
conv::Conv_Depthwise::(GFLOPS=0.108, K=[5 x 5], IN={1, 64, 48, 48}, OCN=64, G=64, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 8.505 8.48 1
conv::Conv_Depthwise::(GFLOPS=0.113, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 17.413 17.43 1
conv::Conv_Depthwise::(GFLOPS=0.118, K=[5 x 5], IN={1, 256, 6, 6}, OCN=256, G=256, P=[2 x 2], BIAS, OCV/CPU) 2.799 2.801 1
conv::Conv_Depthwise::(GFLOPS=0.231, K=[3 x 3], IN={1, 512, 14, 14}, OCN=512, G=512, S=[2 x 2], P=[1 x 1], OCV/CPU) 1.481 1.485 1
conv::Conv_Depthwise::(GFLOPS=0.265, K=[3 x 3], IN={1, 240, 16, 16}, OCN=240, G=240, P=[1 x 1], BIAS, OCV/CPU) 3.003 3.001 1
conv::Conv_Depthwise::(GFLOPS=0.265, K=[5 x 5], IN={1, 384, 14, 14}, OCN=384, G=384, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 4.132 4.124 1
conv::Conv_Depthwise::(GFLOPS=0.293, K=[3 x 3], IN={1, 288, 14, 14}, OCN=288, G=288, P=[1 x 1], BIAS, OCV/CPU) 2.791 2.785 1
conv::Conv_Depthwise::(GFLOPS=0.336, K=[5 x 5], IN={1, 96, 56, 56}, OCN=96, G=96, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 17.708 17.769 1
conv::Conv_Depthwise::(GFLOPS=0.361, K=[5 x 5], IN={1, 336, 16, 16}, OCN=336, G=336, S=[2 x 2], P=[2 x 2], BIAS, OCV/CPU) 5.474 5.475 1
conv::Conv_Depthwise::(GFLOPS=0.382, K=[3 x 3], IN={1, 576, 8, 8}, OCN=576, G=576, P=[1 x 1], BIAS, OCV/CPU) 2.061 2.058 1
conv::Conv_Depthwise::(GFLOPS=0.398, K=[3 x 3], IN={1, 672, 7, 7}, OCN=672, G=672, P=[1 x 1], BIAS, OCV/CPU) 1.919 1.913 1
conv::Conv_Depthwise::(GFLOPS=0.472, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, P=[2 x 2], BIAS, OCV/CPU) 66.218 66.2 1
conv::Conv_Depthwise::(GFLOPS=0.472, K=[5 x 5], IN={1, 64, 48, 48}, OCN=64, G=64, P=[2 x 2], BIAS, OCV/CPU) 33.183 33.168 1
conv::Conv_Depthwise::(GFLOPS=0.472, K=[5 x 5], IN={1, 96, 64, 64}, OCN=96, G=96, S=[2 x 2], P=[2 x 2], BIAS, OCV/CPU) 23.967 23.963 1
conv::Conv_Depthwise::(GFLOPS=0.472, K=[5 x 5], IN={1, 128, 24, 24}, OCN=128, G=128, P=[2 x 2], BIAS, OCV/CPU) 17.507 17.441 1
conv::Conv_Depthwise::(GFLOPS=0.677, K=[5 x 5], IN={1, 40, 184, 184}, OCN=40, G=40, S=[2 x 2], P=[2 x 2], BIAS, OCV/CPU) 82.895 82.762 1
conv::Conv_Depthwise::(GFLOPS=0.737, K=[5 x 5], IN={1, 240, 16, 16}, OCN=240, G=240, P=[2 x 2], BIAS, OCV/CPU) 15.023 15.024 1
conv::Conv_Depthwise::(GFLOPS=0.813, K=[5 x 5], IN={1, 144, 28, 28}, OCN=144, G=144, P=[2 x 2], BIAS, OCV/CPU) 26.411 26.436 1
conv::Conv_Depthwise::(GFLOPS=0.813, K=[5 x 5], IN={1, 288, 14, 14}, OCN=288, G=288, P=[2 x 2], BIAS, OCV/CPU) 14.478 14.505 1
conv::Conv_Depthwise::(GFLOPS=0.925, K=[3 x 3], IN={1, 512, 14, 14}, OCN=512, G=512, P=[1 x 1], OCV/CPU) 4.954 4.949 1
conv::Conv_Depthwise::(GFLOPS=0.925, K=[3 x 3], IN={1, 1024, 7, 7}, OCN=1024, G=1024, P=[1 x 1], OCV/CPU) 2.889 2.889 1
conv::Conv_Depthwise::(GFLOPS=1.062, K=[5 x 5], IN={1, 144, 32, 32}, OCN=144, G=144, P=[2 x 2], BIAS, OCV/CPU) 33.822 33.837 1
conv::Conv_Depthwise::(GFLOPS=1.062, K=[5 x 5], IN={1, 576, 8, 8}, OCN=576, G=576, P=[2 x 2], BIAS, OCV/CPU) 10.065 10.056 1
conv::Conv_Depthwise::(GFLOPS=1.106, K=[5 x 5], IN={1, 672, 7, 7}, OCN=672, G=672, P=[2 x 2], BIAS, OCV/CPU) 9.849 9.884 1
conv::Conv_Depthwise::(GFLOPS=1.344, K=[5 x 5], IN={1, 192, 56, 56}, OCN=192, G=192, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 35.056 35.066 1
conv::Conv_Depthwise::(GFLOPS=1.445, K=[5 x 5], IN={1, 336, 16, 16}, OCN=336, G=336, P=[2 x 2], BIAS, OCV/CPU) 21.019 21.035 1
conv::Conv_Depthwise::(GFLOPS=1.659, K=[5 x 5], IN={1, 960, 14, 14}, OCN=960, G=960, S=[2 x 2], P=[1 x 1], BIAS, OCV/CPU) 10.26 10.255 1
conv::Conv_Depthwise::(GFLOPS=1.734, K=[5 x 5], IN={1, 64, 92, 92}, OCN=64, G=64, P=[2 x 2], BIAS, OCV/CPU) 122.285 122.118 1
conv::Conv_Depthwise::(GFLOPS=2.986, K=[5 x 5], IN={1, 336, 46, 46}, OCN=336, G=336, S=[2 x 2], P=[2 x 2], BIAS, OCV/CPU) 44.679 44.659 1
conv::Conv_Depthwise::(GFLOPS=6.094, K=[5 x 5], IN={1, 480, 23, 23}, OCN=480, G=480, P=[2 x 2], BIAS, OCV/CPU) 62.342 62.74 0.99
conv::Conv_Depthwise::(GFLOPS=6.525, K=[5 x 5], IN={1, 1632, 7, 7}, OCN=1632, G=1632, P=[2 x 2], BIAS, OCV/CPU) 23.647 23.852 0.99

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@hanliutong hanliutong changed the title Further optimize fastDepthwiseConv for RVV. Further optimize fastDepthwiseConv for RISC-V Vector. Apr 7, 2024
@asmorkalov asmorkalov requested a review from mshabunin April 8, 2024 09:57
@asmorkalov asmorkalov added this to the 4.10.0 milestone Apr 9, 2024
@asmorkalov asmorkalov merged commit e4677fb into opencv:4.x Apr 9, 2024
@hanliutong hanliutong deleted the rvv-f32 branch April 10, 2024 02:25
@asmorkalov asmorkalov mentioned this pull request Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants