[GSOC] Speeding-up AKAZE, part #2#8951
Conversation
|
OK, we merged part 1 (#8869). Lets continue here. |
|
Thanks. Sorry for confusion, I should have add more description, to clarify that the situation. I will rebase for sure. |
|
How much it speed up the algo on your machine? |
|
This is work in progress. I will update the description with my measurements. The current improvement is minor. edit: description updated, plese see the stats above. The current speedup is ~1.7x. |
|
rebase due to merge conflict. The commit hashes are now different from what is reported above in the perf stats (I will fix that in the future). |
* now test have images: 600x768, 900x600 and 1385x700 to cover different resolutions
* this takes 84% of time of Feature_Detection * run everything in parallel * compute Scharr kernels just once * compute sigma more efficiently * allocate all matrices in evolution without zeroing
* add Lflow and Lstep to evolution as in original AKAZE code
integrated faster function from https://github.com/h2suzuki/fast_akaze
* improved readability for people familiar with opencv * do not same image twice in base level
* use one pass stencil for diffusity from https://github.com/h2suzuki/fast_akaze * improve locality in Create_Scale_Space
* this needs to be computed always as we need derivatives while computing descriptors * fixed tests of AKAZE with KAZE descriptors which have been affected by this Currently it computes all first and second order derivatives together and the determiant of the hessian. For descriptors it would be enough to compute just first order derivates, but it is not probably worth it optimize for scenario where descriptors and keypoints are computed separately, since it is already very inefficient. When computing keypoint and descriptors together it is faster to do it the current way (preserves locality).
* get rid of sharing buffers when creating scale space pyramid, the performace impact is neglegible
* ensures more stable output * more reasonable profiles, since the first call of parallel_for_ is not getting big performace hit
* no need to go twice through the data
* fixed bug that prevented computing determinant for scale pyramid of size 1 (just the base image) * all descriptors now support writing to uninitialized memory * use InputArray and OutputArray for input image and descriptors, allows to make use UMAt that user passes to us
* all parts that uses ocl-enabled functions should use ocl by now
* when OCL is disabled IPP version should be always prefered (even when the dst is UMat)
* this slows CPU version considerably * do no run in parallel when running with OCL
|
I have evaluated the option of using CV_8U for images and derivations in AKAZE. It does not seem to be a viable path. The precision is affected badly. In our tests only a 40 keypoints have been found out of 507. A viable option might be to use a half precision floats when they become widely available. |
* diffusivity itself is not a blocker, but this saves us downloading and uploading derivations
|
It was worth a shot! |
|
I have finally got the perf measurements for OCL version on GRID K520 nvidia card with OpenCL 1.2. The performance as now is pretty bad, much slower than CPU version. Nevertheless I wasn't able to reproduce the test failure that occurs on Linux OCL buildbot. There is a bug in computing keypoint orientation and computing descriptors, which causes matrices to be downloaded again and again for each keypoint. I need to fix this and then the times will be back reasonable. Apart from this bug, there is a lot of transfers between CPU and GPU while building the scale pyramid. I'm working on porting fast explicit diffusion to GPU, so that almost whole pyramid could be computed on GPU. Some OpenCL functions (GaussianBlur, Scharr) execute non-optimal OCL paths, this will be subject to fine-tuning later. In current state they are slower that IPP equivalents (which is bad). |
we don't want to downlaod matrices ad hoc from gpu when the function in AKAZE needs it. There is a HUGE mapping overhead and without shared memory support a LOT of unnecessary transfers. This maps/downloads matrices just once.
* this was causing spurious segfaults in stitching tests due to propagation of NaNs * added new test, which checks for NaNs (added new debug asserts for NaNs) * valgrind now says everything is ok
|
The builders are green again. I have spent this day bugging with valgrind and gdb to hunt down the bug that was failing the builder. Initialy there was uninitialized memory in just four pixels int the corners, it was spread though the pyramid and messed the results. It also caused crashes via segfaults, if the uninitialized memory could be interpreted as float NaNs. This was quite hard bug to track down, because it was just 4 pixels that has been uninitialized, so it did not caused too much problem. The bug is also highly dependent on selected allocator, which is why it was causing problems only with OpenCL. I'm not sure why it did not cause any problem for Windows OpenCl. I have also fixed the other bug with OpenCL, which caused matrices to be downloaded from GPU multiple times. OpenCL times are now back reasonable, although not really fast. After fixing those 2 bugs, CPU times are a bit worse, but nothing horrible. I look into that see if I can make them better without breaking OpenCL again. I have kernel for OpenCL for non-linear diffusion prepared, after it will be deployed, the whole pyramid construction could be done on GPU. |
* Lt in pyramid changed to UMat, it will be downlaoded from GPU along with Lx, Ly * fix bug in pm_g2 kernel. OpenCV mangles dimensions passed to OpenCL, so we need to check for boundaries in each OCL kernel.
* computing of determinant is not a blocker, but with this change we don't need to download all spatial derivatives to CPU, we only download determinant * make Ldet in the pyramid UMat, download it from CPU together with the other parts of the pyramid * add profiling macros
|
I'm finished with basic OpenCL support in AKAZE. Creation of the scale space pyramid runs almost fully on GPU (except computing k factor, which runs just once before constructing the pyramid). For computing keypoints and descriptors OCL is not supported. Supporting OCL for remaining parts might be interesting only after the creation of the pyramid will be faster, so that the remaining parts become a bottleneck. The current OCL performace is not very good. GaussianBlur, Scharr, sepFilter2D all execute non-optimal OCL paths, especially GaussianBlur and Scharr are slower compared to IPP version. This will need to be optimized. Performace result with NVIDIA GRID K520: The same machine without OpenCL (8 cores): |
|
I have also tried the current OCL version on intel hardware (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz). The ocl implementation is slower than CPU, this is partly because of the unopmized paths for Gaussian and Scharr and partly just because intel GPU is slow. The intel GPU seems to execute better path for sepFilter2D, later I will try to get the same of better on nvidia too. The first test is influenced by kernels compilations time. The same machine with disabled OCL: CPU speedups up to 2.8x looks nice for the current code. |
* TEvolution is used only in KAZE now
|
This PR is now concluded as everything in the work package has been completed. |
|
👍 |
| Mat mask; | ||
| vector<KeyPoint> points; | ||
| // initialize task scheduler for TBB | ||
| cv::setNumThreads(cv::getNumberOfCPUs()); |
There was a problem hiding this comment.
this is useful to get consistent results with instrumentation (first parallel function does not take the initialization penalty). But if you don't like this hack, it can be removed, just the timing will be less stable.
There was a problem hiding this comment.
I believe this should be done in the ts module: #9278
There was a problem hiding this comment.
You are right, I have reverted this change. With #9278 it'll be fine.
This reverts commit ba81e2a.
|
Anything more I should fix? |
|
👍 |
This part focuses on performance improvements for AKAZE on CPU and implementing basic OCL support.
cc: @bmagyar
OpenCL:
some parts (mainly construction of the pyramid) execute OpenCL paths.
Create_Nonlinear_Scale_Spacealmost all operatiins are OpenCl enabledCompute_Determinant_Hessian_Responseexecutes OpenCL general sepFilter2D, computing determinant would need custom kernel, but is not currently blockerFeature_Detectionno OCLCompute_Keypoints_Orientationno OCLCompute_Descriptorsno OCLCPU status:
Create_Nonlinear_Scale_Spacereworked, some intrinsics might help with diffusionCompute_Determinant_Hessian_Responsereworked, needs specialized fine tuning for kernels 5x5, 7x7, 9x9Feature_Detectionthis will be reworked together with GPU part, we might want the same format of keypoints on GPU and CPUCompute_Keypoints_OrientationreworkedCompute_Descriptorsnot a blockerThe main parts (and largest bottlenecks) are:
Create_Nonlinear_Scale_Space~19% ~21msFeature_Detection~52% ~60ms, of which:Compute_Determinant_Hessian_Response83% (over 43% of the whole algorithm)Find_Scale_Space_Extrema16% (globally just 8%)Do_Subpixel_Refinement1%Compute_Keypoints_Orientation~21% ~24msCompute_Descriptors~4% ~4.8msImprovement in
Compute_Determinant_Hessian_Responseis of cause not verysatisfying. The main problem there is that AKAZE uses Scharr with non-standard
size kernel (different from 3x3). Implementing Scharr with sepFilter2D is much
slower than specialized
cv::Scharrwe have.I have tried to replace the
sepFilter2Dimplementation withcv:Scharr,just to get the idea of possible speedup. It is in the separate branch. With
cv::ScharrCompute_Determinant_Hessian_Responsewent from ~47ms to ~11ms,which is quite significant. However it is not possible to replace Scharr just
like, so the tests are failing. I have opened pablofdezalc/akaze#32 to get
some addition info on this.
Performance stats per commit:
I'm testing all CPU performance on
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz.If you want to reproduce these results you on your own machine, you might find this performance evaluation script helpful. It automates running perf tests for this pull request across many revisions, it generates this summary along with xml test outputs and instrumentation output as shown below.
Instrumentation output:
INITIAL:
CURRENT (ba071d1):
Failed branches:
These are branches that contains code that is faster, but not suitable for
including into main branch (there might be failing tests etc.):
test_scharr I have tried to
[replace Scharr operator in
Compute_Determinant_Hessian_Responsewith Scharr[with fixed 3x3 kernel.
akaze_octaves Reworked
[non-linear scale space pyramid so that diffusivity is propagated only inside
[octaves. Probably not worth it, since it damages accuracy.