Add simd256 intrinsics to demosaicing#26012
Conversation
|
|
Do you have numbers from performance tests? (https://github.com/opencv/opencv/wiki/HowToUsePerfTests) |
Sure, sir. Here is it. The conversion to BGA is unsurprisingly slower. This is because most of simd's instructions are processed in 64 bits. Processing a three-channel image requires extra overhead to get rid of those extra zeros. I use the blend operation when processing storage with simd256, which is a big reason for the slowness of this operation, but I haven't come up with a better solution at the moment due to the peculiarities of three-channel storage. |
|
To visualize the difference in timing due to the different transformation types, I've shifted their positions in the table and split them up. I didn't touch any of the values except to change the position. |
|
Could you double check this case (re-measure)? |
Okay. I'll retest it later. |
|
@opencv-alalek @mshabunin I did a separate test for that target(BGR) and it doesn't seem to be performing very well. I guess I should figure out how to optimize it a bit. The extra processing done by the storage does seem a bit too much. |
|
I've tried two programs in the recent past, but neither has worked well. Eventually found this rather strange and rather violent method to be the best option. The test is partially unstable, so it occasionally goes high and low. p.s. The 1080p test was added locally by myself and was not submitted, and this code passed the accuracy tests. |
|
I made experiments with my AMD Ryzen 7 PRO 3700 and see speedup for 2GRAY conversions only. Other cases provide the same performance or worse. |
|
m.b. resolution is too small and numbers too volatile to make a decision. |
|
Just a notice, since #25968 has been merged, bayer2gray needs to adapt the new test case. |
@asmorkalov Thank you for taking the time to test, but there's something strange. In theory, 2GRAY should indeed be the fastest, as it uses the fewest instructions and has the simplest implementation. But as for the 2BGRA and 2BGR_VNG, I didn't do anything extra other than modify the boundary conditions and replace the 128-bit instructions with 256-bit instructions. They should give a slight speedup, but cannot reach 2x. (Tests show it should be around 1.5x) And as for BGR, it's the hardest to deal with, I'll need to (inevitably) use more instructions to store it, and it will probably stay the same or even a little worse in terms of efficiency. I'm at a bit of a loss as to what to do now, I think the rest of the conversions seem to have reached their theoretical limit of efficiency except for 2BGR; and 2BGR I haven't found a faster solution for. |
@FantasqueX Thank you, I will take note and modify my 2GRAY appropriately! |
|
For better performance tests stability you can increase number of test iterations ( |
Thanks, I'll try it later. |
|
Due to my busy work schedule recently, I will be handling this unfinished pr on September 7th and 8th. I apologize for the long wait. |
|
I made experiment with forced samples to 500 on my Ryzen 7 3700: Force serialized (1 thread): |
|
I'll try Intel platform, but I do not expect significant difference. |
|
@asmorkalov Thank you for testing. Would it be a little better with a higher resolution? Like 1920x1080. |
|
I do not know. It may depend on cache size. |
|
I used this command to test: and I added 1080p resolution in test cases. My Intel Core i7-11800H got the following results: Except for the Bayer2BGR problem that is indeed difficult to solve, there are some performance improvements in the rest |
|
Ok, found out the issue reason. OpenCV does not enable AVX2 (256-bit) flags by default to be cross-platform. The new branch is not used in the default build. |
|
More accurate benchmark, 4.x AVX2 vs patch AVX2: |
Is this building a 4.x patch using I'm not quite sure what happens with this, does the 4.x patch execute simd128 or non-simd function? |
|
As stated in #25019, Universal intrinsics is preferred. I just look through |
@FantasqueX This is great. I've been unable to find a performance bottleneck in I'll be working on this on October 5th & October 6th. |
|
Currently I'm observing accuracy test issue with this PR (we don't have AVX2 builders in precommit): |
Okay. I've been a little busy lately, but I'll keep this in mind. Thanks. |
|
Is the PR relevant after #26868? |
Add simd256 intrinsics to demosaicing, boasts 1.5x to 2x efficiency and passes all accuracy tests.
Related issue: #25724
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.