Skip to content

Find contours speedup#26834

Merged
asmorkalov merged 19 commits intoopencv:4.xfrom
chacha21:findContours_speedup
Mar 12, 2025
Merged

Find contours speedup#26834
asmorkalov merged 19 commits intoopencv:4.xfrom
chacha21:findContours_speedup

Conversation

@chacha21
Copy link
Copy Markdown
Contributor

@chacha21 chacha21 commented Jan 23, 2025

This Pull request is still incomplete.
It is an attempt, as suggested by #26775, to restore lost speed when migrating findContours() implementation from C to C++

The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code of findContours() is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.

As mentioned in #26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.

The TreeNodes use another Arena class that does not comply to that range logic.

Currently, there is a significant improvement of the run-time on the test mentioned in #26775, but it is still far from the findContours_legacy() performance.

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@chacha21 chacha21 force-pushed the findContours_speedup branch from bb242cc to 085e8e3 Compare January 23, 2025 20:55
@asmorkalov asmorkalov changed the title Find contours speedup WIP: Find contours speedup Jan 24, 2025
@asmorkalov asmorkalov requested a review from mshabunin January 24, 2025 06:09
@mshabunin
Copy link
Copy Markdown
Contributor

mshabunin commented Jan 25, 2025

I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.

Name of Test before new after new vs before (x-factor) after vs before (x-factor)
findContours_big::TestFindContours 41.837 75.655 61.264 0.55 0.68
malloc calls 200K 23M 13M

Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disabling index_mapping access in the cv::contourTreeToResults fuinction and get performance ~45ms - comparable with old variant.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

@chacha21
Copy link
Copy Markdown
Contributor Author

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping. A stack allocation would help, but the current test case has a tree.size() equal to 127384 elements, which is too large for that.

the stack does not use a growing capacity, vector should perform less allocations
Avoid allocation in TreeIterator by using a specific storage using an arena
No visible improvements on the test case
The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations.
Still no visible performance improvement on test case.
@mshabunin
Copy link
Copy Markdown
Contributor

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping

Did you measure performance difference with index_mapping update on your platform? For me it gave 60 ms -> 45 ms improvement on Linux/GCC.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

@chacha21
Copy link
Copy Markdown
Contributor Author

On my machine (Win7, VS2019)

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

OK, I misunderstood as "preallocate index_mapping memory".

On my host machine, it is still not so close to legacy

@mshabunin
Copy link
Copy Markdown
Contributor

Please take a look at this commit: 5f35ea3
This is block storage which I was writing about. Though I'm not sure how to make it more concise.
On my machine it works for 42 ms VS 40 ms for legacy function. Number of malloc calls is 275K VS 185K. There is still some room for improvement (for example, perhaps we dont need to store reference to storage in each contour, but add is an argument in corresponing methods), but overall it look much better.

@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Jan 26, 2025

Please take a look at this commit: 5f35ea3

[EDIT]:test_impgproc fails with this commit
Ok, it fails because the modifications are not backported to approximateChainTC89

What I understand :

  • this commit is about replacing the Arena (which has a policy of "fixed size, then dynamic allocation when static capacity is reached") by a structure performing incremental dynamic allocations by blocks.
  • I can see that you tested a static size of 600000 bytes for the arena, because of the current test case generating a high number of points. Of course, it would not be a good idea to allocate that much memory for the standard case. Hence it seems doomed, and the Block storage might be better in general.
  • the Block storage is not conceptually very different from a vector. It performs dynamic allocations as it grows.
  • the difference is that a vector does not assume that realloc() might work, and will always copy all its previous content when resized, while the block storage will just add new blocks
  • your BlockStorage is similar to the vectorWithArena that I tried previously for TreeIterator::levels (but vectorWithArena allocated blocks from an arena instead of the standard new)
  • for the current test case, requiring a high amount of memory, a small arena with a fixed stack-allocated buffer won't help, but for the general case it could still be useful for the general case.

A will test the performance with your commit on my host machine ASAP.

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms

I think that I will try to evolve to a BlockStorageWithArena for the general case
Another idea : actually, I am not sure that contour points should be stored as int2(x,y), but rather int(offset). The width information that allows offset->(x,y) can be stored once in the context. I am curious if the saved memory will result in a speed up. There will be an overhead when converting to contour results, a mere memcopy being impossible after that.

@mshabunin
Copy link
Copy Markdown
Contributor

mshabunin commented Jan 27, 2025

As I understand this algorithm requires memory storage with :

  1. O(1) push_back
  2. O(1) random access
  3. elements for each contour are added sequentially
  4. minimum number of allocations and deallocations
  • std::vector satisfies 1-3, but not 4 - due to reallocations.
  • std::deque should work like this (1-3), but it seems that it uses quite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well

I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)

Not sure if I understand. Wouldn't offset need to be 2D?

@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Jan 27, 2025

As I understand this algorithm requires memory storage with : [...]
* std::vector satisfies 1-3, but not 4 - due to reallocations.
* std::deque should work like this (1-3), but it seems that it uses quite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well
I thought about using std::deque with custom allocator and larger blocks, but it seems to make things too complicated.

I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.

Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
However, even better, I think that you could just tune your BlockStorage with one static buffer holding the first blocks (that will be allocated on stack if the BlockStorage itself is allocated on stack before creating the ContourScanner). When more blocks are needed, the dynamic allocation will occur.
No need to delegate that to an arena.

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)
Not sure if I understand. Wouldn't offset need to be 2D?

Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there)
(and we ignore int overflow for very large images)

removed all arenas in favor of BlockStorage
added static blocks to BlockStorage
added range iteration for block storage to optimize copy data out of it
make BlockStorage allocated on stack before ContourScanner instanciation
tested storage of points as int(offset), but it was slower on test cases
@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Jan 27, 2025

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms

a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things.
Nothing was better so far.
even unused and empty,  std::vector<schar> codes allocates memory
using a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation
@chacha21
Copy link
Copy Markdown
Contributor Author

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges + codes also stored in a BlockStorage: ~133ms

removed harmless debugging code, raising warning with some compilers
@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Jan 28, 2025

I ran standard perf tests to see if there are any regression.
It's hard to compare old vs new perstats for perf_impgproc(--gtest_filter=*indContours*), since the number of samples per test varies from run to run.
What's the good practice ? (I can't run python)

@mshabunin
Copy link
Copy Markdown
Contributor

Number of samples is chosen from the predefined range (10-100 by default) taking into account relative performance measurment deviation (median VS mean) and overall test execution time (should not exceed 3 sec by default). Maybe something else.

Use following options to tune the parameters:

  • --perf_min_samples / --perf_force_samples to modify range
  • --perf_max_deviation - to change deviation threshold
  • --perf_time_limit - to change execution time limit

The simplest variant would be to set fixed number of samples using the same value of min_samples and max_samples.


BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations:

Details
#include "perf_precomp.hpp"
#include "opencv2/imgproc/detail/legacy.hpp"

namespace opencv_test { namespace {

PERF_TEST(TestFindContours, findContours_big)
{
    Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
    cv::Mat mask;
    inRange(img, 110, 120, mask);
    std::vector<std::vector<cv::Point>> v_cont;
    TEST_CYCLE() cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
    SANITY_CHECK_NOTHING();
}

PERF_TEST(TestFindContours, findContours_big_old)
{
    Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
    cv::Mat mask;
    inRange(img, 110, 120, mask);
    std::vector<std::vector<cv::Point>> v_cont;
    TEST_CYCLE() cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
    SANITY_CHECK_NOTHING();
}


} } // namespace

@chacha21
Copy link
Copy Markdown
Contributor Author

Ok, I ran perf_test for *indContours* with current 4.x vs current PR
There is probably no significant difference, but PR timings are almost always slightly better.
Sometimes (it differs from run to run), some tests might have <= 2ms regression, but it's probably noise.
In my opinion, this is an overall improvement.

@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Jan 31, 2025

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
The idea is that a referenced part of a block storage (a pair <first,last>) could be ask to be "released", giving a chance to really free it if it happens to be the tail of the block storage.

  • in that case, the block storage can just size -= (last-first)
  • but it is unclear if the block storage must free such released memory (if a block can be freed) or keep it as a remaining capacity (certainly the best choice)
  • I assume that adding size_t capacity() const, void resize() {dangerous if referenced areas}, void shrink_to_fit() are standards that one could expect
  • the "release part" operation would be if (last == storage.size()) storage.resize(storage.size()-(last-first))

There is also the temptation (bait ?) to mimic an allocator by

  • adding an std::pair<size_t, size_t> alloc(size) that would usually append memory like the actual push_back()
  • adding a bool release(const std::pair<size_t, size_t>&) that could use the "release tail if possible", no-op otherwise
  • adding the ability to also mark released areas at the head (like a double-ended buffer), so that future allocations that would fit could reuse that memory (storage start)...(head)(data)...(data)(tail)...(storage end)

And also

  • add a writeTo(size_t first, size_t size, const T* data) that would handle the RangeIterator logic instead of making it public.
  • add a readFrom(size_t first, size_t size, T* data)

That's not a big work, but does it belong to that PR ?

@mshabunin
Copy link
Copy Markdown
Contributor

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case.

I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to the core module (private interface, like BufferArea), extend functionality and cover with tests.

@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Feb 4, 2025

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

I just performed the test :

On the machine where I developed the PR, Windows 7, vs2019

  • cv::findContours_legacy() : ~133ms
  • current PR cv::findContours() : ~137ms

On another Windows 10, vs2022

  • cv::findContours_legacy() : ~66ms
  • current PR cv::findContours() : ~68ms

private:
std::stack<int> levels;
Tree<T>& tree;
vectorOfRanges<int> levels;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm observing better performance with std::stack (though less stable):

On x86_64 (Core i5-11600, Linux):

  • stack: samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%)
  • vectorOfRanges: samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)

On AArch64 (Rockchip RK3588):

  • stack: samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%)
  • vectorOfRanges: samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)

I propose to leave the stack version. Also it is better for simplicity.

-class BlockStorgae has its own file
-TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges)
-ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage
-ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorage

I have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.
Copy link
Copy Markdown
Contributor

@mshabunin mshabunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. With minor cosmetic comments.

@asmorkalov asmorkalov added this to the 4.12.0 milestone Mar 12, 2025
@asmorkalov asmorkalov changed the title WIP: Find contours speedup Find contours speedup Mar 12, 2025
@asmorkalov asmorkalov merged commit d83df66 into opencv:4.x Mar 12, 2025
27 of 28 checks passed
@asmorkalov asmorkalov mentioned this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants