Skip to content

Optimize cv::applyColorMap() for simple case#21645

Merged
alalek merged 8 commits intoopencv:4.xfrom
chacha21:applyColorMap_8UC1_optimized
Mar 1, 2022
Merged

Optimize cv::applyColorMap() for simple case#21645
alalek merged 8 commits intoopencv:4.xfrom
chacha21:applyColorMap_8UC1_optimized

Conversation

@chacha21
Copy link
Copy Markdown
Contributor

@chacha21 chacha21 commented Feb 20, 2022

proposal for #21640
For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism.

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

PR for opencv#21640
For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism.
src as CV_8UC3 is handled with a BGR2GRAY conversion, the same optimized code being used afterwards
srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void {
const int row = position[0];
const int col = position[1];
memcpy(dstMat.data+row*dstMat.step+col*elemSize, userColorMat.data+pixel*userColorMat.step, elemSize);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy(..., elemSize)

Provide performance numbers without and with your patch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Windows 7 64b

//testing cv::applyColorMap
{
	cv::setRNGSeed(1234);
	cv::Mat colorMap = cv::Mat(256, 1, CV_8UC3);
	cv::randu(colorMap, 0, 255);
	std::vector<cv::Size> sizes = {{32, 32}, {64, 64}, {128, 128},  {256, 256}, {512, 512}, {1024, 1024}};
	for(const auto& size : sizes)
	{
		cv::Mat src(size, CV_8UC1);//CV_8UC1//CV_8UC3
		cv::Mat dst(size, CV_8UC3);

		__int64 totalTicks = 0;
		const size_t nbIters = 20000;
		for(size_t i = 0 ; i <nbIters ; ++i)
		{
			cv::randu(src, 0, 255);
			const auto ref = cv::getTickCount();
			cv::applyColorMap(src, dst, colorMap);
			const auto now = cv::getTickCount();
			totalTicks += (now-ref);
		}
		printf("%llu x cv::applyColorMap() for size(%d,%d) and type %s: %f ms\r\n",
			(unsigned long long)nbIters,
			size.width, size.height,
			(src.type() == CV_8UC1) ? "CV_8UC1" :
			(src.type() == CV_8UC3) ? "CV_8UC3" :
			"?",
			1000.*totalTicks/cv::getTickFrequency());
	}
	cv::waitKey();
}

results :

CV_8UC1
original code :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC1: 88.819806 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC1: 269.382595 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC1: 1035.405428 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC1: 3899.305621 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC1: 11115.692492 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC1: 39989.798904 ms

with patch :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC1: 105.990024 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC1: 178.348563 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC1: 390.924573 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC1: 1185.652095 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC1: 3715.295696 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC1: 15669.887541 ms

CV_8UC3
original code :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC3: 159.482952 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC3: 351.207949 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC3: 1232.039576 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC3: 7554.557066 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC3: 17448.084637 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC3: 69502.533269 ms

with patch :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC3: 145.020181 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC3: 263.634428 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC3: 690.046396 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC3: 1889.553916 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC3: 5617.506280 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC3: 22068.121624 ms

srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void {
const int row = position[0];
const int col = position[1];
memcpy(dstMat.data+row*dstMat.step+col*elemSize, userColorMat.data+pixel*userColorMat.step, elemSize);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+row*dstMat.step+col*elemSize
+pixel*userColorMat.step

Manual pointer calculations should be avoided.

Copy link
Copy Markdown
Contributor Author

@chacha21 chacha21 Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a more canonical way to write that without branching one if() per possible userColorMat.type() ?
[edit]
Ok, figured it out

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have re-implemented LUT() functionality in this forEach block.

In general this not a good approach:

  • we have separate code for duplicated functionality.
  • code optimization level is not good (memcpy call with non-fixed dynamic elemSize value)

You should use keep using optimized LUT and tune colormap::UserColorMap implementation instead.


Unfortunately this function is not tested at all:

  • there are no accuracy tests
  • there are no performance tests.

Copy link
Copy Markdown
Contributor Author

@chacha21 chacha21 Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have re-implemented LUT() functionality in this forEach block.

Not really. cv::LUT() requires the same number of channels for src and dst, while in the context of applyColorMap(), src can be CV_8UC1, and dst CV_8UC3, so this is is more a "fast path" to avoid converting input on the sole purpose of making it "cv::LUT compliant"

there are no accuracy tests

I do not know how to show that, but I have run tests to check numerical identical results. (basically, not applying the patch, and comparing "gold" results to custom implementation through countNonZerot(absdiff(..))

there are no performance tests.

I do no know how to add such tests to the OpenCV build/test system, but I have already edited the current PR with a copy/paste of my timings

code optimization level is not good (memcpy call with non-fixed dynamic elemSize value)

branching fixed-size memcpy() for (elemSize == 1) and (elemSize == 3) improves the timings, but I thought it was not a good cv style

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.

Right. Got it. cv::LUT() doesn't support required mode.


To get PR merged please move optimization code to the real processing implementation:

void ColorMap::operator()(InputArray _src, OutputArray _dst) const

Next, There are several assumptions about src/dst/userColor mats. Need to add the right checks about these assumptions (code wont process 3D tensors properly). Reuse applicable checks from cv::LUT() implementation.
BTW, prefer to use CV_Check*() macros instead of CV_Assert().


BTW, probably it makes sense to deprecate and remove later support case with src.channels = 3.

rely on cv::Mat.ptr() to index data
Changes as suggested by reviewer
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update! Overall looks good.

srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void {
const int row = position[0];
const int col = position[1];
dstMat.at<unsigned char>(row, col) = lut.at<unsigned char>(pixel, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lut.at<unsigned char>(pixel, 0);

Please use lut.at<unsigned char>(pixel); as there is no difference in processing of 1D vectors as 2D matrices (both row/col-based are handled by .at<>() method)

.reshape() call above could be removed after that change.


const Mat lut = this->_lut.reshape(0, 256);//just in case LUT was in row format rather than col
const int lut_type = lut.type(), lut_depth = CV_MAT_DEPTH(lut_type), lut_cn = CV_MAT_CN(lut_type);
CV_CheckType(lut_type, (lut_depth == CV_8U) && (lut_cn == 1 || lut_cn == 3),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lut_depth == CV_8U) && (lut_cn == 1 || lut_cn == 3)

this should be equal and simplified condition:

lut_type == CV_8UC1 || lut_type == CV_8UC3

if (lut_type == CV_8UC1)
srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void {
const int row = position[0];
const int col = position[1];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_src.dims() == 2 "assumption" check should be added.

chacha21 added 2 commits March 1, 2022 08:29
improvements suggsted by reviewer
@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Mar 1, 2022

original 4.x code
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 413.340877 ms
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 1322.934941 ms
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 5259.299441 ms
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 19481.239747 ms
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 53744.406390 ms
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 204328.241670 ms
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 31242.315972 ms
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 32304.962187 ms
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 36312.397471 ms
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 51336.647169 ms
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 73202.010330 ms
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 227124.843895 ms

with patch
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 491.504566 ms	=> +19%
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 748.580964 ms => -43%
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 1429.701348 ms	=> -73%
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 4100.422296 ms	=> -79%
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 12436.729648 ms	=> -77%
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 46005.752704 ms	=> -77%
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 44762.778158 ms	=> +43%
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 44510.404844 ms	=> +37%
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 43447.038011 ms	=> +19%
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 36111.826954 ms	=>-30%
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 44628.305535 ms	=> -40%
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 79051.147638 ms	=> -65%

It can be noted that small sizes seem not perform as well, especially for built-in colormaps. I am not sure how to interpret that. I have checked that it is related to the per-row granularity of for_each. Tuning the number of rows handled by each body of parallel_for_, so that each packet have minimum reasonable pixel count, improves performance (I am currently exploring to find a good compromise, taking inspiration from cv::LUT())

@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Mar 1, 2022

After tuning parallel work, timings are improved. But it is certainly heavily dependant on the numbers of CPU core/threads.


original 4.x code
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 413.340877 ms
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 1322.934941 ms
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 5259.299441 ms
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 19481.239747 ms
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 53744.406390 ms
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 204328.241670 ms
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 31242.315972 ms
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 32304.962187 ms
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 36312.397471 ms
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 51336.647169 ms
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 73202.010330 ms
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 227124.843895 ms

with patch
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 304.287106 ms	=> -26%
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 1093.173997 ms	=> -17%
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 2211.233782 ms	=> -58%
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 5893.139673 ms	=> -70%
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 17804.391272 ms	=> -67%
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 68944.028322 ms	=> -67%
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 31083.159630 ms	=> -0.5%
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 31872.934770 ms	=> -1.33%
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 33071.244961 ms	=> -9%
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 37666.202003 ms	=> -27%
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 51594.348558 ms	=> -30%
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 104773.154826 ms	=> -54%

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about possible optimization improvements.

Please let me know when it is ready for merging.

Comment on lines +776 to +781
const int minimalPixelsPerPacket = 1<<12;
const int rowsPerPacket = std::max(1, minimalPixelsPerPacket/srcGray.cols);
const int rowsPacketCount = (srcGray.rows+rowsPerPacket-1)/rowsPerPacket;
const int rows = srcGray.rows;
const int cols = srcGray.cols;
Range all(0, rowsPacketCount);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this prolog code is common between branches from lut_type == CV_8UC1 check

Comment on lines +768 to +770
if (rowsPacketCount <= 1)
body(all);
else
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need rowsPacketCount <= 1 code path - parallel_for() does the same thing internally.

BTW, there is third parameter of parallel_for_() - using of that parameter would allow to eliminate rowsPerPacket variable.

auto body = [&, rows, cols](const Range& range) -> void {
for(int row = range.start*rowsPerPacket ; row<std::min(rows, range.end*rowsPerPacket) ; ++row)
for(int col = 0 ; col<cols ; ++col)
dstMat.at<unsigned char>(row, col) = localLUT[srcGray.at<unsigned char>(row, col)];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rows are stored as a contiguous arrays.
So we could replace .at() full computation of pointers with precomputed row pointer (col = 0, and .ptr() method) before inner loop:

uchar* dstRow = dstMat.ptr<uchar>(row, 0);
const uchar* srcRow = srcMat.ptr<uchar>(row, 0);
for(int col = 0 ; col<cols ; ++col)
{
    dstMat[col] = localLUT[srcRow[col]];
}

Comment on lines +755 to +756
unsigned char localLUT[256];//small performance improvement by bringing the full LUT locally first
_lut.copyTo(cv::Mat(256, 1, lut_type, localLUT));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't need to copy data in case of 1D arrays. They are contiguous by default (except exotic col-based ROI cases).

CV_Assert(lut.isContinuous());
const uchar* lutPtr = lut.ptr<uchar>(0);

use nstripes parameter of parallel_for_
assume _lut is continuous to bring faster pixel indexing
optimize src/dst access by contiguous rows of pixels
do not locally copy the LUT any more, it is no more relevant with the new optimizations
@chacha21
Copy link
Copy Markdown
Contributor Author

chacha21 commented Mar 1, 2022

original 4.x
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 413.340877 ms
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 1322.934941 ms
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 5259.299441 ms
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 19481.239747 ms
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 53744.406390 ms
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 204328.241670 ms
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 31242.315972 ms
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 32304.962187 ms
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 36312.397471 ms
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 51336.647169 ms
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 73202.010330 ms
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 227124.843895 ms

with patch
100000 x cv::applyColorMap(userColor) for size(32,32) and type CV_8UC1: 137.762362 ms => -67%
100000 x cv::applyColorMap(userColor) for size(64,64) and type CV_8UC1: 474.281744 ms => -64%
100000 x cv::applyColorMap(userColor) for size(128,128) and type CV_8UC1: 1154.152432 ms => -78%
100000 x cv::applyColorMap(userColor) for size(256,256) and type CV_8UC1: 3046.539612 ms => -84%
100000 x cv::applyColorMap(userColor) for size(512,512) and type CV_8UC1: 8813.734257 ms => -84%
100000 x cv::applyColorMap(userColor) for size(1024,1024) and type CV_8UC1: 32218.623244 ms => -84%
100000 x cv::applyColorMap(JET) for size(32,32) and type CV_8UC1: 30845.112720 ms => -1.3%
100000 x cv::applyColorMap(JET) for size(64,64) and type CV_8UC1: 31141.354067 ms => -3.6%
100000 x cv::applyColorMap(JET) for size(128,128) and type CV_8UC1: 31967.281620 ms => -12%
100000 x cv::applyColorMap(JET) for size(256,256) and type CV_8UC1: 35685.712188 ms => -30%
100000 x cv::applyColorMap(JET) for size(512,512) and type CV_8UC1: 40515.865259 ms => -45%
100000 x cv::applyColorMap(JET) for size(1024,1024) and type CV_8UC1: 62885.093060 ms => -72%

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you for contribution 👍

@alalek alalek merged commit ebb6915 into opencv:4.x Mar 1, 2022
@opencv-pushbot opencv-pushbot mentioned this pull request Apr 23, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
…mized

Optimize cv::applyColorMap() for simple case

* Optimize cv::applyColorMap() for simple case

PR for 21640
For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism.

* add support for src as CV_8UC3

src as CV_8UC3 is handled with a BGR2GRAY conversion, the same optimized code being used afterwards

* code style

rely on cv::Mat.ptr() to index data

* Move new implementation to ColorMap::operator()

Changes as suggested by reviewer

* style

improvements suggsted by reviewer

* typo

* tune parallel work

* better usage of parallel_for_

use nstripes parameter of parallel_for_
assume _lut is continuous to bring faster pixel indexing
optimize src/dst access by contiguous rows of pixels
do not locally copy the LUT any more, it is no more relevant with the new optimizations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

suboptimal cv::applyColorMap(...) with cv::ColorMap::operator()(...)

2 participants