Skip to content

proposal to add native Windows aligned memory apis --> little benefit #19147

@diablodale

Description

@diablodale

Below is evaluation to add native Windows memory alignment to OpenCV.

Tldr: I found no conclusive evidence of an improvement or benefit to use the native
aligned memory Windows apis. Alternate memory allocators from http://hoard.org/ or TBB
https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Memory_Allocation.html
might yield better results across platforms.

I will leave this issue open for a week to capture comments. After that, I intend
to close it since I do not intend to submit a PR. This issue will serve as
information for any future questions/investigations into this topic.

Proposal

Windows has a series of specific APIs for aligned memory. I wanted to explore
if these APIs are any benefit compared to OpenCV's existing alignment method.
To use the native Windows apis, the following is needed:

  • Extend OpenCV cmake and core allocation cpp for Windows native aligned
  • In Windows codepath, use _aligned_malloc() and _aligned_free()
  • Add test cases to validate accuracy
  • Collect accuracy and performance test results
  • Ensure no fragmentation side-affects
  • Evaluate benefit to switch to native Windows APIs

Code

The code changes are straighforward. I've done the work based on
an OpenCV master commit between 4.5.0 and 4.5.1 and stored in the following branch
https://github.com/diablodale/opencv/tree/win32AlignAlloc
I have not submitted it as a PR since I have no evidence it is a benefit.

Unit Test for Accuracy

A new test_allocation.cpp was created with a test case that allocates
200 buffers ranging from 1 byte to approximately 8K 32bpp image. The proposed
code in the branch accurately aligns all buffers.

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Core_Allocation
[ RUN      ] Core_Allocation.alignedAllocation
[       OK ] Core_Allocation.alignedAllocation (90 ms)
[----------] 1 test from Core_Allocation (90 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (93 ms total)
[  PASSED  ] 1 test.

Performance Test

There is no clear evidence that using the native Windows aligned memory apis
increase performance. In contrast, there is some evidence that using the standard
malloc/free and OpenCV's custom alignment code is ~2% faster.

Below is a performance test summary with a minimum of 50 samples for each test.
Values are in milliseconds. I do not have performance test results of the original
code that has no specific Windows support. I do not believe this is necessary since
it eliminates entire blocks of code, branching, etc. Therefore, the
performance of that code should be the same/faster as the disable results below.

disable1/2 = native aligned memory disabled with envvar OPENCV_ENABLE_MEMALIGN=0
native1/2 = native aligned memory enabled with envvar OPENCV_ENABLE_MEMALIGN=1

Name of Test disable1 disable2 native1 native2 disable2 vs disable 1 (xfactor) native1 vs disable1 (xfactor) native2 vs disable1 (xfactor)
Allocation_Aligned::MatDepth_tb::8UC1 360.699 391.031 384.763 369.164 0.92 0.94 0.98
Allocation_Aligned::MatDepth_tb::16SC1 482.317 480.611 484.58 477.638 1 1 1.01
Allocation_Aligned::MatDepth_tb::8UC3 538.072 557.373 562.83 551.79 0.97 0.96 0.98
Allocation_Aligned::MatDepth_tb::8UC4 587.058 600.413 611.409 596.915 0.98 0.96 0.98

Fragmentation Test

Tldr: No significant fragmentation problems seen. Changes in working set and peak commit
vary less than 1% up and down.

I adapted the Linux test written by
@mshabunin at https://gist.github.com/mshabunin/8f6d0d4d1ad26b8fdec878ab650a0df2 to
evaluate fragmentation sideaffects like seen in Linux
#15526. My Windows version is at
https://gist.github.com/diablodale/189082bac1e244bba8906ba175a1f3e7

The tests were run in both modes (raw aligned allocation, cv::Mat allocation). The cv::Mat
test was run using both alignment methods (custom OpenCV, native Windows). Finally, it
was run in Debug and Release builds. Data was collected. I do not see the dramatic memory
increases as seen on Linux (e.g. 1.2 GB at iteration 400). With 8000 iterations,
the peak commited memory never exceeded 10 MB in any test.

While each test was run, the process was examined using Windows Resource Monitor.
During all 4 tests, there were no dramatic increases in process-specific memory
(committed, working set, sharable, private) or dramatic increases in overall
Windows memory usage. In Release builds, there were brief moments
when the working set would be approximately 9MB but immediately would reduce to
the 7MB range -- suggesting reuse/trimming/cleanup occuring. I did not
observe this in Debug builds.

All numbers are bytes
raw unaligned = allocations using plain malloc() and free()
raw aligned = aligned allocations using native aligned Windows apis
Mat custom = aligned allocations using OpenCV's custom alignment code
Mat native = aligned allocations using Windows native alignment apis

Release build                  
  raw unaligned raw aligned diff change ... Mat custom Mat native diff change
Start - Working set 6,098,944 6,094,848 (4,096) -0.07%   6,348,800 6,348,800 - 0.00%
End - Working set 7,016,448 7,016,448 - 0.00%   7,163,904 7,159,808 (4,096) -0.06%
Difference 917,504 921,600       815,104 811,008    
                   
Start - Commit charge 3,629,056 3,637,248 8,192 0.23%   3,686,400 3,682,304 (4,096) -0.11%
End - Commit charge 4,571,136 4,583,424 12,288 0.27%   4,538,368 4,534,272 (4,096) -0.09%
Difference 942,080 946,176       851,968 851,968    
                   
Start - Peak commit 6,701,056 6,692,864 (8,192) -0.12%   6,696,960 6,733,824 36,864 0.55%
End - Peak commit charge 7,651,328 7,655,424 4,096 0.05%   7,610,368 7,614,464 4,096 0.05%
Difference 950,272 962,560       913,408 880,640    
Debug build                  
  raw unaligned raw aligned diff change ... Mat custom Mat native diff change
Start - Working set 6,541,312 6,557,696 16,384 0.25%   7,434,240 7,434,240 - 0.00%
End - Working set 8,581,120 8,601,600 20,480 0.24%   9,515,008 9,510,912 (4,096) -0.04%
Difference 2,039,808 2,043,904       2,080,768 2,076,672    
                   
Start - Commit charge 3,657,728 3,661,824 4,096 0.11%   3,792,896 3,772,416 (20,480) -0.54%
End - Commit charge 5,918,720 5,967,872 49,152 0.83%   6,098,944 6,107,136 8,192 0.13%
Difference 2,260,992 2,306,048       2,306,048 2,334,720    
                   
Start - Peak commit 6,692,864 6,672,384 (20,480) -0.31%   6,803,456 6,787,072 (16,384) -0.24%
End - Peak commit charge 8,994,816 9,043,968 49,152 0.55%   9,170,944 9,187,328 16,384 0.18%
Difference 2,301,952 2,371,584       2,367,488 2,400,256    

Each of those tests (cells in table) were run only once! Only one sample!
There are minor variations that occur when each test is run repeatedly. On
my computer, they are all multiples of 4 KB -- seen in the table above and
below. For example, when I run the following Release build test cases a 2nd
time...

Release build        
  run 1 run 2 difference change
raw unaligned        
Start - Working set 6,098,944 6,094,848 (4,096) -0.067%
End - Peak commit charge 7,651,328 7,655,424 4,096 0.054%
         
Mat native        
Start - Working set 6,348,800 6,348,800 - 0.000%
End - Peak commit charge 7,614,464 7,606,272 (8,192) -0.108%

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions