Impl hal_rvv LUT | Add more LUT test#26941
Conversation
|
The CI failed due to the SANITY_CHECK. I'm unsure whether I should use SANITY_CHECK_NOTHING, as the original LUT performance test uses SANITY_CHECK, and these newly added tests are the same set of tests. |
5a81d8d to
52545f7
Compare
|
My performance results for Muse Pi v30 (gcc 14.2): |
|
K1 (clang spacemit-v1.0.4) K1 vs RK3568:
|
|
Yes, |
Since this patch does not bring speedup for multi-channel lut case, I wonder whether the test is needed. |
|
I'm not sure, because it's actually also part of the LUT test. |
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Implement through the existing
cv_hal_lutinterfaces.Add more LUT accuracy and performance tests:
randuused for generating test data is broadened to make the test more robust.Perf test done on
Geometric mean (ms) Name of Test scalar ui rvv ui rvv vs vs scalar scalar (x-factor) (x-factor) LUT::SizePrm::320x240 0.248 0.249 0.052 1.00 4.74 LUT::SizePrm::640x480 0.277 0.275 0.085 1.01 3.28 LUT::SizePrm::1920x1080 0.950 0.947 0.634 1.00 1.50 LUT_multi2::SizePrm::320x240 2.051 2.045 2.049 1.00 1.00 LUT_multi2::SizePrm::640x480 2.128 2.134 2.125 1.00 1.00 LUT_multi2::SizePrm::1920x1080 7.397 7.380 7.390 1.00 1.00 LUT_multi::SizePrm::320x240 0.715 0.747 0.154 0.96 4.64 LUT_multi::SizePrm::640x480 0.741 0.766 0.257 0.97 2.88 LUT_multi::SizePrm::1920x1080 2.766 2.765 1.925 1.00 1.44This optimization is achieved by loading the entire lookup table into vector registers. Due to register size limitations, the optimization is only effective under the following conditions:
vlen >= 256vlen >= 512vlen >= 1024Since I don’t have real hardware with
vlen > 256, the corresponding accuracy tests were conducted on QEMU built from theriscv-collab/riscv-gnu-toolchain.This patch does not implement optimizations for multi-channel tables.
Previous attempts:
vlen = 128, it is possible to use fouru8m4vectors to load the entire table, perform gathering, and merge the results. However, the performance is almost the same as the scalar version.vluxei8as a general solution does not show any performance improvement.Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.