Skip to content

core: vectorize cv::normalize / cv::norm#26885

Merged
asmorkalov merged 26 commits intoopencv:4.xfrom
fengyuentau:4x/core/normalize_simd
Feb 21, 2025
Merged

core: vectorize cv::normalize / cv::norm#26885
asmorkalov merged 26 commits intoopencv:4.xfrom
fengyuentau:4x/core/normalize_simd

Conversation

@fengyuentau
Copy link
Copy Markdown
Member

@fengyuentau fengyuentau commented Feb 7, 2025

Checklist:

normInf normL1 normL2
bool - - -
8u
8s
16u
16s
16f - - -
16bf - - -
32u - - -
32s
32f
64u - - -
64s - - -
64f

*: Vectorization of data type bool, 16f, 16bf, 32u, 64u and 64s needs to be done on 5.x.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@fengyuentau fengyuentau marked this pull request as ready for review February 10, 2025 10:08
@asmorkalov
Copy link
Copy Markdown
Contributor

asmorkalov commented Feb 10, 2025

Hm.. The patch introduces significant regression for L1 and L2 on ARMv8 (Tested with Jetson Orin):

Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_INF) 	0.053 	0.010 	5.51
Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_L1) 	0.020 	0.067 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC1, NORM_L2) 	0.042 	0.071 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_INF) 	0.563 	0.042 	13.26
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_L1) 	0.214 	0.142 	1.51
Norm1Arg::OCL_NormFixture::(640x480, 32FC1, NORM_L2) 	0.198 	0.121 	1.64
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_INF) 	0.158 	0.031 	5.04
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_L1) 	0.060 	0.200 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC3, NORM_L2) 	0.126 	0.213 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_INF) 	1.688 	0.138 	12.27
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_L1) 	0.639 	0.428 	1.49
Norm1Arg::OCL_NormFixture::(640x480, 32FC3, NORM_L2) 	0.599 	0.361 	1.66
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_INF) 	0.212 	0.043 	4.93
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_L1) 	0.081 	0.267 	0.30
Norm1Arg::OCL_NormFixture::(640x480, 8UC4, NORM_L2) 	0.167 	0.284 	0.59
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_INF) 	2.248 	0.193 	11.66
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_L1) 	0.873 	0.589 	1.48
Norm1Arg::OCL_NormFixture::(640x480, 32FC4, NORM_L2) 	0.812 	0.506 	1.61
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_INF) 	0.159 	0.031 	5.13
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_L1) 	0.061 	0.199 	0.31
Norm1Arg::OCL_NormFixture::(1280x720, 8UC1, NORM_L2) 	0.126 	0.212 	0.59
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_INF) 	1.688 	0.138 	12.26
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_L1) 	0.650 	0.425 	1.53
Norm1Arg::OCL_NormFixture::(1280x720, 32FC1, NORM_L2) 	0.601 	0.361 	1.66
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_INF) 	0.475 	0.101 	4.71
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_L1) 	0.179 	0.593 	0.30
Norm1Arg::OCL_NormFixture::(1280x720, 8UC3, NORM_L2) 	0.377 	0.634 	0.60
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_INF) 	5.050 	0.818 	6.17
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_L1) 	1.914 	1.266 	1.51
Norm1Arg::OCL_NormFixture::(1280x720, 32FC3, NORM_L2) 	1.786 	1.075 	1.66
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_INF) 	0.655 	0.137 	4.79
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_L1) 	0.248 	0.795 	0.31
Norm1Arg::OCL_NormFixture::(1280x720, 8UC4, NORM_L2) 	0.505 	0.847 	0.60
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_INF) 	6.731 	1.164 	5.78
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_L1) 	2.539 	1.692 	1.50
Norm1Arg::OCL_NormFixture::(1280x720, 32FC4, NORM_L2) 	2.375 	1.440 	1.65
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_INF) 	0.360 	0.074 	4.84
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_L1) 	0.136 	0.447 	0.30
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC1, NORM_L2) 	0.283 	0.475 	0.59
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_INF) 	3.787 	0.564 	6.72
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_L1) 	1.424 	0.949 	1.50
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC1, NORM_L2) 	1.334 	0.806 	1.66
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_INF) 	1.079 	0.343 	3.15
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_L1) 	0.461 	1.364 	0.34
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC3, NORM_L2) 	0.888 	1.494 	0.59
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_INF) 	11.359 	2.103 	5.40
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_L1) 	4.318 	2.836 	1.52
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC3, NORM_L2) 	3.999 	2.412 	1.66
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_INF) 	1.427 	0.533 	2.68
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_L1) 	0.604 	1.776 	0.34
Norm1Arg::OCL_NormFixture::(1920x1080, 8UC4, NORM_L2) 	1.134 	1.898 	0.60
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_INF) 	15.141 	2.876 	5.27
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_L1) 	5.868 	3.791 	1.55
Norm1Arg::OCL_NormFixture::(1920x1080, 32FC4, NORM_L2) 	5.333 	3.230 	1.65 

perf-norm.zip

@asmorkalov asmorkalov self-requested a review February 10, 2025 15:25
@fengyuentau
Copy link
Copy Markdown
Member Author

@asmorkalov I do not have access to orin now, but can you apply the following changes and run the test instead of the OCL ones?

diff --git a/modules/core/perf/perf_norm.cpp b/modules/core/perf/perf_norm.cpp
index 07f989f21c..e364192bad 100644
--- a/modules/core/perf/perf_norm.cpp
+++ b/modules/core/perf/perf_norm.cpp
@@ -14,7 +14,7 @@ typedef perf::TestBaseWithParam<Size_MatType_NormType_t> Size_MatType_NormType;
 PERF_TEST_P(Size_MatType_NormType, norm,
             testing::Combine(
                 testing::Values(TYPICAL_MAT_SIZES),
-                testing::Values(TYPICAL_MAT_TYPES),
+                testing::Values(CV_8UC1, CV_8SC1, CV_16UC1, CV_16SC1, CV_32SC1, CV_32FC1, CV_64FC1),
                 testing::Values((int)NORM_INF, (int)NORM_L1, (int)NORM_L2)
                 )
             )

I also observed some performance regressions on my host with i7-12700k cpu.

@fengyuentau
Copy link
Copy Markdown
Member Author

I also observed some performance regressions on my host with i7-12700k cpu.

Found that if cpu baseline is sse, there are performance regressions, mostly on 8-bit, 16-bit kernels. If cpu baseline is set to AVX/AVX2, this PR makes sense with some performance improvement.

@asmorkalov
Copy link
Copy Markdown
Contributor

The fix does not resole ARM regression:
jetson-orin-3.zip

@fengyuentau
Copy link
Copy Markdown
Member Author

Regression on RISC-V for some reason:

https://github.com/opencv/opencv/actions/runs/13302059428/job/37145128708?pr=26885#step:16:715

[ RUN      ] Core_Array.basic_operations
/home/ci/opencv/modules/ts/src/ts.cpp:612: Failure
Failed

	failure reason: Invalid function output
	test case #-1
	seed: 00000000000c5a61
-----------------------------------
	LOG:
4: The norms are different: 801.37103666866482854/7290.6957819068729805/1996.2299161095104409 vs 801.37103666866482854/7290.69580078125/1996.2299466744807432

-----------------------------------

[  FAILED  ] Core_Array.basic_operations (24 ms)

Tested on my SpaceMIT MUSE Pi (K1) with GCC, it was green.

@fengyuentau

This comment was marked as outdated.

Comment on lines +581 to +587
struct NormL2_SIMD<double, double> {
double operator() (const double* src, int n) const {
int j = 0;
double s = 0.f;
#if CV_RVV // This is introduced to workaround the accuracy issue on ci
s = normL2_rvv(src, n, j);
#else
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is introduced to workaround failed tests on ci of rvv node.

The RVV CI node is getting weird some accuracy issues on the normL2 with double.

But it works fine on real hardware. Could it be some qemu bugs?

Comment on lines +465 to +471
struct NormL1_SIMD<double, double> {
double operator() (const double* src, int n) const {
int j = 0;
double s = 0.f;
#if CV_RVV // This is introduced to workaround the accuracy issue on ci
s = normL1_rvv(src, n, j);
#else
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is introduced to workaround failed tests on ci of rvv node.

The RVV CI node is getting weird some accuracy issues on the normL1 with double.

But it works fine on real hardware. Could it be some qemu bugs?

@fengyuentau
Copy link
Copy Markdown
Member Author

@asmorkalov Hello, I have fixed performance on Orin with gcc == 11.4.0. See the attached file for performance.
orin-perf.zip

@fengyuentau
Copy link
Copy Markdown
Member Author

fengyuentau commented Feb 17, 2025

By the way, performance comparison with #26887 on K1.

k1.zip

k1.zip (with 8UC4 test case included)

@fengyuentau
Copy link
Copy Markdown
Member Author

@asmorkalov Let me know if there is any blocker of merging this PR.

@asmorkalov
Copy link
Copy Markdown
Contributor

The last version is much better for ARMv8. Jetson Orin does not show regressions.
jetson-orin-5.zip

@fengyuentau
Copy link
Copy Markdown
Member Author

@asmorkalov By the way, CI node "PR:4.x / Android-Test / BuildAndTest (pull_request)" seems to be broken. Should we disable this for now since you do not seem to have time to fix it?

@asmorkalov
Copy link
Copy Markdown
Contributor

Jetson tk1 (armv7) result. There are several regressions for FP32. I do not think it's critical.
jetson-tk1-2.zip

@asmorkalov asmorkalov self-assigned this Feb 18, 2025
@asmorkalov asmorkalov merged commit e2803be into opencv:4.x Feb 21, 2025
28 checks passed
NanQin555 pushed a commit to NanQin555/opencv that referenced this pull request Feb 24, 2025
core: vectorize cv::normalize / cv::norm opencv#26885

Checklist:
|      | normInf | normL1 | normL2 |
| ---- | ------- | ------ | ------ |
| bool |    -    |   -    |   -    |
| 8u   |    √    |   √    |   √    |
| 8s   |    √    |   √    |   √    |
| 16u  |    √    |   √    |   √    |
| 16s  |    √    |   √    |   √    |
| 16f  |    -    |   -    |   -    |
| 16bf |    -    |   -    |   -    |
| 32u  |    -    |   -    |   -    |
| 32s  |    √    |   √    |   √    |
| 32f  |    √    |   √    |   √    |
| 64u  |    -    |   -    |   -    |
| 64s  |    -    |   -    |   -    |
| 64f  |    √    |   √    |   √    |

*: Vectorization of data type bool, 16f, 16bf, 32u, 64u and 64s needs to be done on 5.x.

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
@asmorkalov asmorkalov mentioned this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants