proposal to add native Windows aligned memory apis --> little benefit

Below is evaluation to add native Windows memory alignment to OpenCV.

Tldr: I found no conclusive evidence of an improvement or benefit to use the native
aligned memory Windows apis. Alternate memory allocators from http://hoard.org/ or TBB
https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Memory_Allocation.html
might yield better results across platforms.

I will leave this issue open for a week to capture comments. After that, I intend
to close it since I do not intend to submit a PR. This issue will serve as
information for any future questions/investigations into this topic.

### Proposal

Windows has a series of specific APIs for aligned memory. I wanted to explore
if these APIs are any benefit compared to OpenCV's existing alignment method.
To use the native Windows apis, the following is needed:

* Extend OpenCV cmake and core allocation cpp for Windows native aligned
* In Windows codepath, use `_aligned_malloc()` and `_aligned_free()`
* Add test cases to validate accuracy
* Collect accuracy and performance test results
* Ensure no fragmentation side-affects
* Evaluate benefit to switch to native Windows APIs

### Code

The code changes are straighforward. I've done the work based on
an OpenCV master commit between 4.5.0 and 4.5.1 and stored in the following branch
https://github.com/diablodale/opencv/tree/win32AlignAlloc
I have not submitted it as a PR since I have no evidence it is a benefit.

### Unit Test for Accuracy

A new `test_allocation.cpp` was created with a test case that allocates
200 buffers ranging from 1 byte to approximately 8K 32bpp image. The proposed
code in the branch accurately aligns all buffers.

```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Core_Allocation
[ RUN      ] Core_Allocation.alignedAllocation
[       OK ] Core_Allocation.alignedAllocation (90 ms)
[----------] 1 test from Core_Allocation (90 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (93 ms total)
[  PASSED  ] 1 test.
```

### Performance Test

There is no clear evidence that using the native Windows aligned memory apis
increase performance. In contrast, there is some evidence that using the standard
malloc/free and OpenCV's custom alignment code is ~2% faster.

Below is a performance test summary with a minimum of 50 samples for each test.
Values are in milliseconds. I do not have performance test results of the original
code that has *no* specific Windows support. I do not believe this is necessary since
it eliminates entire blocks of code, branching, etc. Therefore, the
performance of that code should be the same/faster as the `disable` results below.

disable1/2 = native aligned memory disabled with envvar OPENCV_ENABLE_MEMALIGN=0
native1/2 = native aligned memory enabled with envvar OPENCV_ENABLE_MEMALIGN=1


Name of   Test | disable1 | disable2 | native1 | native2 | disable2 vs disable 1 (xfactor) | native1 vs disable1 (xfactor) | native2 vs disable1 (xfactor)
-- | -- | -- | -- | -- | -- | -- | --
Allocation_Aligned::MatDepth_tb::8UC1 | 360.699 | 391.031 | 384.763 | 369.164 | 0.92 | 0.94 | 0.98
Allocation_Aligned::MatDepth_tb::16SC1 | 482.317 | 480.611 | 484.58 | 477.638 | 1 | 1 | 1.01
Allocation_Aligned::MatDepth_tb::8UC3 | 538.072 | 557.373 | 562.83 | 551.79 | 0.97 | 0.96 | 0.98
Allocation_Aligned::MatDepth_tb::8UC4 | 587.058 | 600.413 | 611.409 | 596.915 | 0.98 | 0.96 | 0.98

### Fragmentation Test

Tldr: No significant fragmentation problems seen. Changes in working set and peak commit
vary less than 1% up and down.

I adapted the Linux test written by
@mshabunin at https://gist.github.com/mshabunin/8f6d0d4d1ad26b8fdec878ab650a0df2 to
evaluate fragmentation sideaffects like seen in Linux
https://github.com/opencv/opencv/issues/15526. My Windows version is at
https://gist.github.com/diablodale/189082bac1e244bba8906ba175a1f3e7

The tests were run in both modes (raw aligned allocation, cv::Mat allocation). The cv::Mat
test was run using both alignment methods (custom OpenCV, native Windows). Finally, it
was run in Debug and Release builds. Data was collected. I do not see the dramatic memory
increases as seen on Linux (e.g. 1.2 GB at iteration 400). With 8000 iterations,
the peak commited memory never exceeded 10 MB in any test.

While each test was run, the process was examined using Windows Resource Monitor.
During all 4 tests, there were no dramatic increases in process-specific memory
(committed, working set, sharable, private) or dramatic increases in overall
Windows memory usage. In Release builds, there were brief moments
when the working set would be approximately 9MB but immediately would reduce to
the 7MB range -- suggesting reuse/trimming/cleanup occuring. I did not
observe this in Debug builds.

All numbers are bytes
raw unaligned = allocations using plain malloc() and free()
raw aligned = aligned allocations using native aligned Windows apis
Mat custom = aligned allocations using OpenCV's custom alignment code
Mat native = aligned allocations using Windows native alignment apis


Release   build |   |   |   |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
  | raw unaligned | raw aligned | diff | change | ... | Mat custom | Mat native | diff | change
Start - Working set | 6,098,944 | 6,094,848 | (4,096) | -0.07% |   | 6,348,800 | 6,348,800 | - | 0.00%
End - Working set | 7,016,448 | 7,016,448 | - | 0.00% |   | 7,163,904 | 7,159,808 | (4,096) | -0.06%
Difference | 917,504 | 921,600 |   |   |   | 815,104 | 811,008 |   |  
  |   |   |   |   |   |   |   |   |  
Start - Commit charge | 3,629,056 | 3,637,248 | 8,192 | 0.23% |   | 3,686,400 | 3,682,304 | (4,096) | -0.11%
End - Commit charge | 4,571,136 | 4,583,424 | 12,288 | 0.27% |   | 4,538,368 | 4,534,272 | (4,096) | -0.09%
Difference | 942,080 | 946,176 |   |   |   | 851,968 | 851,968 |   |  
  |   |   |   |   |   |   |   |   |  
Start - Peak commit | 6,701,056 | 6,692,864 | (8,192) | -0.12% |   | 6,696,960 | 6,733,824 | 36,864 | 0.55%
End - Peak commit charge | 7,651,328 | 7,655,424 | 4,096 | 0.05% |   | 7,610,368 | 7,614,464 | 4,096 | 0.05%
Difference | 950,272 | 962,560 |   |   |   | 913,408 | 880,640 |   |  

Debug   build |   |   |   |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
  | raw unaligned | raw aligned | diff | change | ... | Mat custom | Mat native | diff | change
Start - Working set | 6,541,312 | 6,557,696 | 16,384 | 0.25% |   | 7,434,240 | 7,434,240 | - | 0.00%
End - Working set | 8,581,120 | 8,601,600 | 20,480 | 0.24% |   | 9,515,008 | 9,510,912 | (4,096) | -0.04%
Difference | 2,039,808 | 2,043,904 |   |   |   | 2,080,768 | 2,076,672 |   |  
  |   |   |   |   |   |   |   |   |  
Start - Commit charge | 3,657,728 | 3,661,824 | 4,096 | 0.11% |   | 3,792,896 | 3,772,416 | (20,480) | -0.54%
End - Commit charge | 5,918,720 | 5,967,872 | 49,152 | 0.83% |   | 6,098,944 | 6,107,136 | 8,192 | 0.13%
Difference | 2,260,992 | 2,306,048 |   |   |   | 2,306,048 | 2,334,720 |   |  
  |   |   |   |   |   |   |   |   |  
Start - Peak commit | 6,692,864 | 6,672,384 | (20,480) | -0.31% |   | 6,803,456 | 6,787,072 | (16,384) | -0.24%
End - Peak commit charge | 8,994,816 | 9,043,968 | 49,152 | 0.55% |   | 9,170,944 | 9,187,328 | 16,384 | 0.18%
Difference | 2,301,952 | 2,371,584 |   |   |   | 2,367,488 | 2,400,256 |   |  

Each of those tests (cells in table) were run only once! Only one sample!
There are minor variations that occur when each test is run repeatedly. On
my computer, they are all multiples of 4 KB -- seen in the table above and
below. For example, when I run the following Release build test cases a 2nd
time... 

Release   build |   |   |   |  
-- | -- | -- | -- | --
  | run 1 | run 2 | difference | change
raw unaligned |   |   |   |  
Start - Working set | 6,098,944 | 6,094,848 | (4,096) | -0.067%
End - Peak commit charge | 7,651,328 | 7,655,424 | 4,096 | 0.054%
  |   |   |   |  
Mat native |   |   |   |  
Start - Working set | 6,348,800 | 6,348,800 | - | 0.000%
End - Peak commit charge | 7,614,464 | 7,606,272 | (8,192) | -0.108%




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal to add native Windows aligned memory apis --> little benefit #19147

Proposal

Code

Unit Test for Accuracy

Performance Test

Fragmentation Test

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Name of Test	disable1	disable2	native1	native2	disable2 vs disable 1 (xfactor)	native1 vs disable1 (xfactor)	native2 vs disable1 (xfactor)
Allocation_Aligned::MatDepth_tb::8UC1	360.699	391.031	384.763	369.164	0.92	0.94	0.98
Allocation_Aligned::MatDepth_tb::16SC1	482.317	480.611	484.58	477.638	1	1	1.01
Allocation_Aligned::MatDepth_tb::8UC3	538.072	557.373	562.83	551.79	0.97	0.96	0.98
Allocation_Aligned::MatDepth_tb::8UC4	587.058	600.413	611.409	596.915	0.98	0.96	0.98

Release build
	raw unaligned	raw aligned	diff	change	...	Mat custom	Mat native	diff	change
Start - Working set	6,098,944	6,094,848	(4,096)	-0.07%		6,348,800	6,348,800	-	0.00%
End - Working set	7,016,448	7,016,448	-	0.00%		7,163,904	7,159,808	(4,096)	-0.06%
Difference	917,504	921,600				815,104	811,008

Start - Commit charge	3,629,056	3,637,248	8,192	0.23%		3,686,400	3,682,304	(4,096)	-0.11%
End - Commit charge	4,571,136	4,583,424	12,288	0.27%		4,538,368	4,534,272	(4,096)	-0.09%
Difference	942,080	946,176				851,968	851,968

Start - Peak commit	6,701,056	6,692,864	(8,192)	-0.12%		6,696,960	6,733,824	36,864	0.55%
End - Peak commit charge	7,651,328	7,655,424	4,096	0.05%		7,610,368	7,614,464	4,096	0.05%
Difference	950,272	962,560				913,408	880,640

Release build
	run 1	run 2	difference	change
raw unaligned
Start - Working set	6,098,944	6,094,848	(4,096)	-0.067%
End - Peak commit charge	7,651,328	7,655,424	4,096	0.054%

Mat native
Start - Working set	6,348,800	6,348,800	-	0.000%
End - Peak commit charge	7,614,464	7,606,272	(8,192)	-0.108%

Uh oh!

proposal to add native Windows aligned memory apis --> little benefit #19147

Description

Proposal

Code

Unit Test for Accuracy

Performance Test

Fragmentation Test

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions