Skip to content

Impl hal_rvv LUT | Add more LUT test#26941

Merged
asmorkalov merged 3 commits intoopencv:4.xfrom
GenshinImpactStarts:lut_hal_rvv
Mar 6, 2025
Merged

Impl hal_rvv LUT | Add more LUT test#26941
asmorkalov merged 3 commits intoopencv:4.xfrom
GenshinImpactStarts:lut_hal_rvv

Conversation

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor

Implement through the existing cv_hal_lut interfaces.

Add more LUT accuracy and performance tests:

  • Accuracy test: Multi-channel table tests are added, and the boundary of randu used for generating test data is broadened to make the test more robust.
  • Performance test: Multi-channel input and multi-channel table tests are added.

Perf test done on

  • MUSE-PI (vlen=256)
  • Compiler: gcc 14.2 (riscv-collab/riscv-gnu-toolchain Nightly: December 16, 2024)
$ opencv_test_core --gtest_filter="Core_LUT*"
$ opencv_perf_core --gtest_filter="SizePrm_LUT*" --perf_min_samples=300 --perf_force_samples=300
Geometric mean (ms)

         Name of Test          scalar   ui    rvv       ui        rvv    
                                                        vs         vs    
                                                      scalar     scalar  
                                                    (x-factor) (x-factor)
LUT::SizePrm::320x240          0.248  0.249  0.052     1.00       4.74   
LUT::SizePrm::640x480          0.277  0.275  0.085     1.01       3.28   
LUT::SizePrm::1920x1080        0.950  0.947  0.634     1.00       1.50   
LUT_multi2::SizePrm::320x240   2.051  2.045  2.049     1.00       1.00   
LUT_multi2::SizePrm::640x480   2.128  2.134  2.125     1.00       1.00   
LUT_multi2::SizePrm::1920x1080 7.397  7.380  7.390     1.00       1.00   
LUT_multi::SizePrm::320x240    0.715  0.747  0.154     0.96       4.64   
LUT_multi::SizePrm::640x480    0.741  0.766  0.257     0.97       2.88   
LUT_multi::SizePrm::1920x1080  2.766  2.765  1.925     1.00       1.44  

This optimization is achieved by loading the entire lookup table into vector registers. Due to register size limitations, the optimization is only effective under the following conditions:

  • For the U8C1 table type, the optimization works when vlen >= 256
  • For U16C1, it works when vlen >= 512
  • For U32C1, it works when vlen >= 1024

Since I don’t have real hardware with vlen > 256, the corresponding accuracy tests were conducted on QEMU built from the riscv-collab/riscv-gnu-toolchain.

This patch does not implement optimizations for multi-channel tables.

Previous attempts:

  1. For the U8C1 table type, when vlen = 128, it is possible to use four u8m4 vectors to load the entire table, perform gathering, and merge the results. However, the performance is almost the same as the scalar version.
  2. Loading part of the table and repeatedly loading the source data is faster for small sizes. But as the table size grows, the performance quickly degrades compared to the scalar version.
  3. Using vluxei8 as a general solution does not show any performance improvement.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

The CI failed due to the SANITY_CHECK. I'm unsure whether I should use SANITY_CHECK_NOTHING, as the original LUT performance test uses SANITY_CHECK, and these newly added tests are the same set of tests.

Copy link
Copy Markdown
Contributor

@mshabunin mshabunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide what to do with parallel HAL functions.

UPD: after discussion we decided to allow cv::parallel_for for now until more appropriate solution is available.

@asmorkalov
Copy link
Copy Markdown
Contributor

My performance results for Muse Pi v30 (gcc 14.2):

LUT::SizePrm::320x240 	0.251 	0.053 	4.71
LUT::SizePrm::640x480 	0.287 	0.088 	3.26
LUT::SizePrm::1920x1080 	0.946 	0.646 	1.46
LUT_multi2::SizePrm::320x240 	2.063 	2.056 	1.00
LUT_multi2::SizePrm::640x480 	2.141 	2.145 	1.00
LUT_multi2::SizePrm::1920x1080 	7.457 	7.397 	1.01
LUT_multi::SizePrm::320x240 	0.750 	0.154 	4.86
LUT_multi::SizePrm::640x480 	0.775 	0.255 	3.04
LUT_multi::SizePrm::1920x1080 	2.691 	1.959 	1.37 

@mshabunin mshabunin removed the RFC label Mar 2, 2025
@fengyuentau
Copy link
Copy Markdown
Member

fengyuentau commented Mar 3, 2025

K1 (clang spacemit-v1.0.4)

LUT::SizePrm::320x240            0.290   0.054    5.42
LUT::SizePrm::640x480            0.309   0.093    3.30
LUT::SizePrm::1920x1080          1.053   0.643    1.64
LUT_multi2::SizePrm::320x240     1.948   1.954    1.00
LUT_multi2::SizePrm::640x480     2.032   2.056    0.99
LUT_multi2::SizePrm::1920x1080   7.110   7.195    0.99
LUT_multi::SizePrm::320x240      0.860   0.156    5.50
LUT_multi::SizePrm::640x480      0.892   0.263    3.39
LUT_multi::SizePrm::1920x1080    3.075   1.909    1.61

K1 vs RK3568:

LUT::SizePrm::320x240           0.249   0.054    4.66
LUT::SizePrm::640x480           0.266   0.093    2.85
LUT::SizePrm::1920x1080         1.814   0.643    2.82
LUT_multi2::SizePrm::320x240    1.258   1.954    0.64
LUT_multi2::SizePrm::640x480    1.309   2.056    0.64
LUT_multi2::SizePrm::1920x1080  8.991   7.195    1.25
LUT_multi::SizePrm::320x240     0.754   0.156    4.82
LUT_multi::SizePrm::640x480     0.794   0.263    3.01
LUT_multi::SizePrm::1920x1080   5.397   1.909    2.83

LUT_multi2 does not have speedup. Is this expected?

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

Yes, LUT_multi2 is a new test where both the input and the table are multi-channel, and in this case, I don't apply any optimization.

@fengyuentau
Copy link
Copy Markdown
Member

Yes, LUT_multi2 is a new test where both the input and the table are multi-channel, and in this case, I don't apply any optimization.

Since this patch does not bring speedup for multi-channel lut case, I wonder whether the test is needed.

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

I'm not sure, because it's actually also part of the LUT test.

GenshinImpactStarts and others added 3 commits March 6, 2025 09:00
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
@asmorkalov asmorkalov merged commit 57a78cb into opencv:4.x Mar 6, 2025
24 of 28 checks passed
@asmorkalov asmorkalov mentioned this pull request Mar 11, 2025
@GenshinImpactStarts GenshinImpactStarts deleted the lut_hal_rvv branch March 12, 2025 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants