Optimize cv::applyColorMap() for simple case#21645
Conversation
PR for opencv#21640 For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism.
src as CV_8UC3 is handled with a BGR2GRAY conversion, the same optimized code being used afterwards
modules/imgproc/src/colormap.cpp
Outdated
| srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void { | ||
| const int row = position[0]; | ||
| const int col = position[1]; | ||
| memcpy(dstMat.data+row*dstMat.step+col*elemSize, userColorMat.data+pixel*userColorMat.step, elemSize); |
There was a problem hiding this comment.
memcpy(..., elemSize)
Provide performance numbers without and with your patch.
There was a problem hiding this comment.
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Windows 7 64b
//testing cv::applyColorMap
{
cv::setRNGSeed(1234);
cv::Mat colorMap = cv::Mat(256, 1, CV_8UC3);
cv::randu(colorMap, 0, 255);
std::vector<cv::Size> sizes = {{32, 32}, {64, 64}, {128, 128}, {256, 256}, {512, 512}, {1024, 1024}};
for(const auto& size : sizes)
{
cv::Mat src(size, CV_8UC1);//CV_8UC1//CV_8UC3
cv::Mat dst(size, CV_8UC3);
__int64 totalTicks = 0;
const size_t nbIters = 20000;
for(size_t i = 0 ; i <nbIters ; ++i)
{
cv::randu(src, 0, 255);
const auto ref = cv::getTickCount();
cv::applyColorMap(src, dst, colorMap);
const auto now = cv::getTickCount();
totalTicks += (now-ref);
}
printf("%llu x cv::applyColorMap() for size(%d,%d) and type %s: %f ms\r\n",
(unsigned long long)nbIters,
size.width, size.height,
(src.type() == CV_8UC1) ? "CV_8UC1" :
(src.type() == CV_8UC3) ? "CV_8UC3" :
"?",
1000.*totalTicks/cv::getTickFrequency());
}
cv::waitKey();
}
results :
CV_8UC1
original code :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC1: 88.819806 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC1: 269.382595 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC1: 1035.405428 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC1: 3899.305621 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC1: 11115.692492 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC1: 39989.798904 ms
with patch :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC1: 105.990024 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC1: 178.348563 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC1: 390.924573 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC1: 1185.652095 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC1: 3715.295696 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC1: 15669.887541 ms
CV_8UC3
original code :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC3: 159.482952 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC3: 351.207949 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC3: 1232.039576 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC3: 7554.557066 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC3: 17448.084637 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC3: 69502.533269 ms
with patch :
20000 x cv::applyColorMap() for size(32,32) and type CV_8UC3: 145.020181 ms
20000 x cv::applyColorMap() for size(64,64) and type CV_8UC3: 263.634428 ms
20000 x cv::applyColorMap() for size(128,128) and type CV_8UC3: 690.046396 ms
20000 x cv::applyColorMap() for size(256,256) and type CV_8UC3: 1889.553916 ms
20000 x cv::applyColorMap() for size(512,512) and type CV_8UC3: 5617.506280 ms
20000 x cv::applyColorMap() for size(1024,1024) and type CV_8UC3: 22068.121624 ms
modules/imgproc/src/colormap.cpp
Outdated
| srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void { | ||
| const int row = position[0]; | ||
| const int col = position[1]; | ||
| memcpy(dstMat.data+row*dstMat.step+col*elemSize, userColorMat.data+pixel*userColorMat.step, elemSize); |
There was a problem hiding this comment.
+row*dstMat.step+col*elemSize
+pixel*userColorMat.step
Manual pointer calculations should be avoided.
There was a problem hiding this comment.
What would be a more canonical way to write that without branching one if() per possible userColorMat.type() ?
[edit]
Ok, figured it out
There was a problem hiding this comment.
You have re-implemented LUT() functionality in this forEach block.
In general this not a good approach:
- we have separate code for duplicated functionality.
- code optimization level is not good (memcpy call with non-fixed dynamic elemSize value)
You should use keep using optimized LUT and tune colormap::UserColorMap implementation instead.
Unfortunately this function is not tested at all:
- there are no accuracy tests
- there are no performance tests.
There was a problem hiding this comment.
You have re-implemented LUT() functionality in this forEach block.
Not really. cv::LUT() requires the same number of channels for src and dst, while in the context of applyColorMap(), src can be CV_8UC1, and dst CV_8UC3, so this is is more a "fast path" to avoid converting input on the sole purpose of making it "cv::LUT compliant"
there are no accuracy tests
I do not know how to show that, but I have run tests to check numerical identical results. (basically, not applying the patch, and comparing "gold" results to custom implementation through countNonZerot(absdiff(..))
there are no performance tests.
I do no know how to add such tests to the OpenCV build/test system, but I have already edited the current PR with a copy/paste of my timings
code optimization level is not good (memcpy call with non-fixed dynamic elemSize value)
branching fixed-size memcpy() for (elemSize == 1) and (elemSize == 3) improves the timings, but I thought it was not a good cv style
There was a problem hiding this comment.
Not really.
Right. Got it. cv::LUT() doesn't support required mode.
To get PR merged please move optimization code to the real processing implementation:
void ColorMap::operator()(InputArray _src, OutputArray _dst) const
Next, There are several assumptions about src/dst/userColor mats. Need to add the right checks about these assumptions (code wont process 3D tensors properly). Reuse applicable checks from cv::LUT() implementation.
BTW, prefer to use CV_Check*() macros instead of CV_Assert().
BTW, probably it makes sense to deprecate and remove later support case with src.channels = 3.
rely on cv::Mat.ptr() to index data
Changes as suggested by reviewer
alalek
left a comment
There was a problem hiding this comment.
Thank you for update! Overall looks good.
modules/imgproc/src/colormap.cpp
Outdated
| srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void { | ||
| const int row = position[0]; | ||
| const int col = position[1]; | ||
| dstMat.at<unsigned char>(row, col) = lut.at<unsigned char>(pixel, 0); |
There was a problem hiding this comment.
lut.at<unsigned char>(pixel, 0);
Please use lut.at<unsigned char>(pixel); as there is no difference in processing of 1D vectors as 2D matrices (both row/col-based are handled by .at<>() method)
.reshape() call above could be removed after that change.
modules/imgproc/src/colormap.cpp
Outdated
|
|
||
| const Mat lut = this->_lut.reshape(0, 256);//just in case LUT was in row format rather than col | ||
| const int lut_type = lut.type(), lut_depth = CV_MAT_DEPTH(lut_type), lut_cn = CV_MAT_CN(lut_type); | ||
| CV_CheckType(lut_type, (lut_depth == CV_8U) && (lut_cn == 1 || lut_cn == 3), |
There was a problem hiding this comment.
(lut_depth == CV_8U) && (lut_cn == 1 || lut_cn == 3)
this should be equal and simplified condition:
lut_type == CV_8UC1 || lut_type == CV_8UC3
modules/imgproc/src/colormap.cpp
Outdated
| if (lut_type == CV_8UC1) | ||
| srcGray.forEach<unsigned char>([&](unsigned char& pixel, const int* position) -> void { | ||
| const int row = position[0]; | ||
| const int col = position[1]; |
There was a problem hiding this comment.
_src.dims() == 2 "assumption" check should be added.
It can be noted that small sizes seem not perform as well, especially for built-in colormaps. |
|
After tuning parallel work, timings are improved. But it is certainly heavily dependant on the numbers of CPU core/threads. |
alalek
left a comment
There was a problem hiding this comment.
Some comments about possible optimization improvements.
Please let me know when it is ready for merging.
modules/imgproc/src/colormap.cpp
Outdated
| const int minimalPixelsPerPacket = 1<<12; | ||
| const int rowsPerPacket = std::max(1, minimalPixelsPerPacket/srcGray.cols); | ||
| const int rowsPacketCount = (srcGray.rows+rowsPerPacket-1)/rowsPerPacket; | ||
| const int rows = srcGray.rows; | ||
| const int cols = srcGray.cols; | ||
| Range all(0, rowsPacketCount); |
There was a problem hiding this comment.
Looks like this prolog code is common between branches from lut_type == CV_8UC1 check
modules/imgproc/src/colormap.cpp
Outdated
| if (rowsPacketCount <= 1) | ||
| body(all); | ||
| else |
There was a problem hiding this comment.
We don't really need rowsPacketCount <= 1 code path - parallel_for() does the same thing internally.
BTW, there is third parameter of parallel_for_() - using of that parameter would allow to eliminate rowsPerPacket variable.
modules/imgproc/src/colormap.cpp
Outdated
| auto body = [&, rows, cols](const Range& range) -> void { | ||
| for(int row = range.start*rowsPerPacket ; row<std::min(rows, range.end*rowsPerPacket) ; ++row) | ||
| for(int col = 0 ; col<cols ; ++col) | ||
| dstMat.at<unsigned char>(row, col) = localLUT[srcGray.at<unsigned char>(row, col)]; |
There was a problem hiding this comment.
Rows are stored as a contiguous arrays.
So we could replace .at() full computation of pointers with precomputed row pointer (col = 0, and .ptr() method) before inner loop:
uchar* dstRow = dstMat.ptr<uchar>(row, 0);
const uchar* srcRow = srcMat.ptr<uchar>(row, 0);
for(int col = 0 ; col<cols ; ++col)
{
dstMat[col] = localLUT[srcRow[col]];
}
modules/imgproc/src/colormap.cpp
Outdated
| unsigned char localLUT[256];//small performance improvement by bringing the full LUT locally first | ||
| _lut.copyTo(cv::Mat(256, 1, lut_type, localLUT)); |
There was a problem hiding this comment.
Usually we don't need to copy data in case of 1D arrays. They are contiguous by default (except exotic col-based ROI cases).
CV_Assert(lut.isContinuous());
const uchar* lutPtr = lut.ptr<uchar>(0);
use nstripes parameter of parallel_for_ assume _lut is continuous to bring faster pixel indexing optimize src/dst access by contiguous rows of pixels do not locally copy the LUT any more, it is no more relevant with the new optimizations
|
alalek
left a comment
There was a problem hiding this comment.
Well done! Thank you for contribution 👍
…mized Optimize cv::applyColorMap() for simple case * Optimize cv::applyColorMap() for simple case PR for 21640 For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism. * add support for src as CV_8UC3 src as CV_8UC3 is handled with a BGR2GRAY conversion, the same optimized code being used afterwards * code style rely on cv::Mat.ptr() to index data * Move new implementation to ColorMap::operator() Changes as suggested by reviewer * style improvements suggsted by reviewer * typo * tune parallel work * better usage of parallel_for_ use nstripes parameter of parallel_for_ assume _lut is continuous to bring faster pixel indexing optimize src/dst access by contiguous rows of pixels do not locally copy the LUT any more, it is no more relevant with the new optimizations
proposal for #21640
For regular cv::Mat CV_8UC1 src, applying the colormap is simpler than calling the cv::LUT() mechanism.
Patch to opencv_extra has the same branch name.