core: vectorize cv::normalize / cv::norm by fengyuentau · Pull Request #26885 · opencv/opencv

fengyuentau · 2025-02-07T12:41:12Z

Checklist:

	normInf	normL1	normL2
bool	-	-	-
8u	√	√	√
8s	√	√	√
16u	√	√	√
16s	√	√	√
16f	-	-	-
16bf	-	-	-
32u	-	-	-
32s	√	√	√
32f	√	√	√
64u	-	-	-
64s	-	-	-
64f	√	√	√

*: Vectorization of data type bool, 16f, 16bf, 32u, 64u and 64s needs to be done on 5.x.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

asmorkalov · 2025-02-10T15:24:38Z

Hm.. The patch introduces significant regression for L1 and L2 on ARMv8 (Tested with Jetson Orin):

Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_INF) 	0.053 	0.010 	5.51
Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_L1) 	0.020 	0.067 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_L2) 	0.042 	0.071 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_INF) 	0.563 	0.042 	13.26
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_L1) 	0.214 	0.142 	1.51
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_L2) 	0.198 	0.121 	1.64
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_INF) 	0.158 	0.031 	5.04
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_L1) 	0.060 	0.200 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_L2) 	0.126 	0.213 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_INF) 	1.688 	0.138 	12.27
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_L1) 	0.639 	0.428 	1.49
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_L2) 	0.599 	0.361 	1.66
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_INF) 	0.212 	0.043 	4.93
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_L1) 	0.081 	0.267 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_L2) 	0.167 	0.284 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_INF) 	2.248 	0.193 	11.66
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_L1) 	0.873 	0.589 	1.48
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_L2) 	0.812 	0.506 	1.61
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_INF) 	0.159 	0.031 	5.13
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_L1) 	0.061 	0.199 	0.31
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_L2) 	0.126 	0.212 	0.59
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_INF) 	1.688 	0.138 	12.26
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_L1) 	0.650 	0.425 	1.53
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_L2) 	0.601 	0.361 	1.66
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_INF) 	0.475 	0.101 	4.71
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_L1) 	0.179 	0.593 	0.30
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_L2) 	0.377 	0.634 	0.60
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_INF) 	5.050 	0.818 	6.17
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_L1) 	1.914 	1.266 	1.51
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_L2) 	1.786 	1.075 	1.66
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_INF) 	0.655 	0.137 	4.79
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_L1) 	0.248 	0.795 	0.31
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_L2) 	0.505 	0.847 	0.60
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_INF) 	6.731 	1.164 	5.78
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_L1) 	2.539 	1.692 	1.50
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_L2) 	2.375 	1.440 	1.65
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_INF) 	0.360 	0.074 	4.84
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_L1) 	0.136 	0.447 	0.30
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_L2) 	0.283 	0.475 	0.59
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_INF) 	3.787 	0.564 	6.72
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_L1) 	1.424 	0.949 	1.50
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_L2) 	1.334 	0.806 	1.66
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_INF) 	1.079 	0.343 	3.15
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_L1) 	0.461 	1.364 	0.34
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_L2) 	0.888 	1.494 	0.59
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_INF) 	11.359 	2.103 	5.40
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_L1) 	4.318 	2.836 	1.52
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_L2) 	3.999 	2.412 	1.66
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_INF) 	1.427 	0.533 	2.68
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_L1) 	0.604 	1.776 	0.34
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_L2) 	1.134 	1.898 	0.60
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_INF) 	15.141 	2.876 	5.27
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_L1) 	5.868 	3.791 	1.55
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_L2) 	5.333 	3.230 	1.65

perf-norm.zip

fengyuentau · 2025-02-11T06:13:00Z

@asmorkalov I do not have access to orin now, but can you apply the following changes and run the test instead of the OCL ones?

diff --git a/modules/core/perf/perf_norm.cpp b/modules/core/perf/perf_norm.cpp
index 07f989f21c..e364192bad 100644
--- a/modules/core/perf/perf_norm.cpp
+++ b/modules/core/perf/perf_norm.cpp
@@ -14,7 +14,7 @@ typedef perf::TestBaseWithParam<Size_MatType_NormType_t> Size_MatType_NormType;
 PERF_TEST_P(Size_MatType_NormType, norm,
             testing::Combine(
                 testing::Values(TYPICAL_MAT_SIZES),
-                testing::Values(TYPICAL_MAT_TYPES),
+                testing::Values(CV_8UC1, CV_8SC1, CV_16UC1, CV_16SC1, CV_32SC1, CV_32FC1, CV_64FC1),
                 testing::Values((int)NORM_INF, (int)NORM_L1, (int)NORM_L2)
                 )
             )

I also observed some performance regressions on my host with i7-12700k cpu.

fengyuentau · 2025-02-11T09:49:57Z

I also observed some performance regressions on my host with i7-12700k cpu.

Found that if cpu baseline is sse, there are performance regressions, mostly on 8-bit, 16-bit kernels. If cpu baseline is set to AVX/AVX2, this PR makes sense with some performance improvement.

asmorkalov · 2025-02-11T15:04:40Z

The fix does not resole ARM regression:
jetson-orin-3.zip

fengyuentau · 2025-02-13T08:57:07Z

Regression on RISC-V for some reason:

https://github.com/opencv/opencv/actions/runs/13302059428/job/37145128708?pr=26885#step:16:715

[ RUN      ] Core_Array.basic_operations
/home/ci/opencv/modules/ts/src/ts.cpp:612: Failure
Failed

	failure reason: Invalid function output
	test case #-1
	seed: 00000000000c5a61
-----------------------------------
	LOG:
4: The norms are different: 801.37103666866482854/7290.6957819068729805/1996.2299161095104409 vs 801.37103666866482854/7290.69580078125/1996.2299466744807432

-----------------------------------

[  FAILED  ] Core_Array.basic_operations (24 ms)

Tested on my SpaceMIT MUSE Pi (K1) with GCC, it was green.

modules/core/src/norm.simd.hpp

…cy issue only on ci

fengyuentau · 2025-02-14T11:31:45Z

modules/core/src/norm.simd.hpp

+struct NormL2_SIMD<double, double> {
+    double operator() (const double* src, int n) const {
+        int j = 0;
+        double s = 0.f;
+#if CV_RVV // This is introduced to workaround the accuracy issue on ci
+        s = normL2_rvv(src, n, j);
+#else


This is introduced to workaround failed tests on ci of rvv node.

The RVV CI node is getting weird some accuracy issues on the normL2 with double.

But it works fine on real hardware. Could it be some qemu bugs?

fengyuentau · 2025-02-14T11:32:01Z

modules/core/src/norm.simd.hpp

+struct NormL1_SIMD<double, double> {
+    double operator() (const double* src, int n) const {
+        int j = 0;
+        double s = 0.f;
+#if CV_RVV // This is introduced to workaround the accuracy issue on ci
+        s = normL1_rvv(src, n, j);
+#else


This is introduced to workaround failed tests on ci of rvv node.

The RVV CI node is getting weird some accuracy issues on the normL1 with double.

But it works fine on real hardware. Could it be some qemu bugs?

…rformance

fengyuentau · 2025-02-17T08:29:45Z

@asmorkalov Hello, I have fixed performance on Orin with gcc == 11.4.0. See the attached file for performance.
orin-perf.zip

fengyuentau · 2025-02-17T10:06:04Z

By the way, performance comparison with #26887 on K1.

~~k1.zip~~

k1.zip (with 8UC4 test case included)

fengyuentau · 2025-02-17T10:27:21Z

@asmorkalov Let me know if there is any blocker of merging this PR.

asmorkalov · 2025-02-18T08:24:29Z

The last version is much better for ARMv8. Jetson Orin does not show regressions.
jetson-orin-5.zip

fengyuentau · 2025-02-18T09:17:46Z

@asmorkalov By the way, CI node "PR:4.x / Android-Test / BuildAndTest (pull_request)" seems to be broken. Should we disable this for now since you do not seem to have time to fix it?

asmorkalov · 2025-02-18T10:27:26Z

Jetson tk1 (armv7) result. There are several regressions for FP32. I do not think it's critical.
jetson-tk1-2.zip

modules/core/src/norm.dispatch.cpp

core: vectorize cv::normalize / cv::norm opencv#26885 Checklist: | | normInf | normL1 | normL2 | | ---- | ------- | ------ | ------ | | bool | - | - | - | | 8u | √ | √ | √ | | 8s | √ | √ | √ | | 16u | √ | √ | √ | | 16s | √ | √ | √ | | 16f | - | - | - | | 16bf | - | - | - | | 32u | - | - | - | | 32s | √ | √ | √ | | 32f | √ | √ | √ | | 64u | - | - | - | | 64s | - | - | - | | 64f | √ | √ | √ | *: Vectorization of data type bool, 16f, 16bf, 32u, 64u and 64s needs to be done on 5.x. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake

fengyuentau added 3 commits February 7, 2025 20:23

vectorized normL1 32f

9f21d3e

vectorized normL2 32f

604e66d

vectorized normInf 32f

423bbf0

fengyuentau added optimization category: core labels Feb 7, 2025

fengyuentau added this to the 4.12.0 milestone Feb 7, 2025

fengyuentau requested review from asmorkalov and vpisarev February 7, 2025 12:41

fengyuentau added 6 commits February 10, 2025 16:18

vectorized normInf, normL1, normL2 16u

a89a5a2

vectorized normInf, normL1, normL2 8u

0ae47d6

vectorized normInf, normL1, normL2 32s

af259c3

vectorized normInf, normL1, normL2 8s

924e0a2

vectorized normInf, normL1, normL2 16s

15cd3c2

vectorized normInf, normL1, normL2 64f

dd3fbee

fengyuentau marked this pull request as ready for review February 10, 2025 10:08

asmorkalov approved these changes Feb 10, 2025

View reviewed changes

fengyuentau added 2 commits February 10, 2025 22:20

fix accuracy bug on 8s, 16s kernels

6e1c074

fix perf on 8s, 16s and 32s kernels

21f7b4f

asmorkalov self-requested a review February 10, 2025 15:25

fix normInf short* kernel

b9752b0

minor fixes for perf

dc729dd

fengyuentau added 2 commits February 12, 2025 17:09

decrease normInf 32s unrolling to 4

d049c74

further optimized normL2Sqr for all data types

0768a6c

This comment was marked as outdated.

Sign in to view

split norm into dispatcher and simd header

2fa3d4d

workaround weird crash on ci of rvv node

0b14b2f

fengyuentau commented Feb 14, 2025

View reviewed changes

modules/core/src/norm.simd.hpp Outdated Show resolved Hide resolved

fengyuentau commented Feb 14, 2025

View reviewed changes

modules/core/src/norm.simd.hpp Outdated Show resolved Hide resolved

introduce implementations with native rvv intrinsics to bypass accura…

e2bc549

…cy issue only on ci

fengyuentau commented Feb 14, 2025

View reviewed changes

fengyuentau added 6 commits February 15, 2025 02:40

fix a bug in non-simd normInf kernel; boost normL2 schar kernel for rvv

42257a5

drop the boost and use dotprod_fast and dotprod_expand_fast instead

b74fa3f

use v_dotprod_expand_fast for NormL1_SIMD of 8u, 8s, 16u and 16s

7c540d6

fixed build warnings; added normL1_rvv for ushort and short to fix pe…

ce48e95

…rformance

revert changes in normL1 of short and ushort to fix performance

057a10f

fixed and added serveral kernels for rvv to make it faster

3908104

asmorkalov reviewed Feb 18, 2025

View reviewed changes

modules/core/src/norm.dispatch.cpp Show resolved Hide resolved

asmorkalov self-assigned this Feb 18, 2025

fengyuentau added 2 commits February 18, 2025 20:06

simplify if calling norml2, norml2sqr, norml1 and norminf on 32f data

501ae29

fix normInf on 32f input

5854b65

asmorkalov added the platform: riscv label Feb 19, 2025

asmorkalov merged commit e2803be into opencv:4.x Feb 21, 2025
28 checks passed

asmorkalov mentioned this pull request Mar 4, 2025

5.x merge 4.x #27009

Merged

opencv-alalek mentioned this pull request Mar 14, 2025

core: improve norm of hal rvv #26991

Merged

6 tasks

Uh oh!

Conversation

fengyuentau commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

asmorkalov commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Feb 11, 2025

Uh oh!

fengyuentau commented Feb 11, 2025

Uh oh!

asmorkalov commented Feb 11, 2025

Uh oh!

fengyuentau commented Feb 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

fengyuentau Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

fengyuentau Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

fengyuentau commented Feb 17, 2025

Uh oh!

fengyuentau commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Feb 17, 2025

Uh oh!

asmorkalov commented Feb 18, 2025

Uh oh!

fengyuentau commented Feb 18, 2025

Uh oh!

asmorkalov commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fengyuentau commented Feb 7, 2025 •

edited

Loading

asmorkalov commented Feb 10, 2025 •

edited

Loading

fengyuentau commented Feb 17, 2025 •

edited

Loading