-
-
Notifications
You must be signed in to change notification settings - Fork 56.5k
Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? #23223
Description
System Information
OpenCV version: 4.6.0
Operating System / Platform: Ubuntu 20.04 / Windows / WSL2 / Docker
Compiler & compiler version: GCC 9,11, MSVC 2017, 2019, Python 3.10
CUDA: 11.4, 11.8
Detailed description
A word of preface:
I am observing up to 50% larger inference times on one environment compared to other. Both environments are running on the same hardware and OpenCV have been compiled the same way. What is interesting is that performance is constant regardless of operating system.
For example: I wrote a c++ executable and I've run it on a WSL2 image (Ubuntu 20.04/CUDA 11.8) where my inference time is 25 FPS and I have a docker image (Ubuntu 20.04/CUDA 11.8) where I compiled OpenCV exactly the same and inference time is 15 FPS.
I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS.
So the problem is not within my code . and it does not depend on the operating system. Each time I've used the same OpenCV version with the same set of options. I am suspecting that maybe during compilation of OpenCV something sometimes is compiled differently?
I've dug deeper. I've profiled OpenCV, here are the results:
ID name count thr min max median avg *self* IPP % OpenCL %
t-min t-max t-median t-avg total t-IPP % t-OpenCL %
1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93 241 1 34.539 1569.779 35.574 43.279 10430.354 0.000 0 0.000 0
ID name count thr min max median avg *self* IPP % OpenCL %
t-min t-max t-median t-avg total t-IPP % t-OpenCL %
1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93 241 1 58.370 1596.572 60.892 67.375 16237.484 0.000 0 0.000 0
58.370 1596.572 60.892 67.375 16237.484 0.000 0 0.000 0
This is the major difference. So I know the problem lies within dnn. I'looked at source and discovered that I can time "layers". So I did that.
Here are top 20 worst times (seconds) where inference is slower:
6.0574 :onnx_node!Slice_19
5.3924 :onnx_node!Slice_29
5.3768 :onnx_node!Slice_9
2.8331 :onnx_node!Slice_4
2.6924 :onnx_node!Slice_24
2.6597 :onnx_node!Slice_14
2.6464 :onnx_node!Slice_34
0.1466 :onnx_node!Concat_40
0.1194 :onnx_node!Slice_39
0.0923 :onnx_node!Mul_43
0.0827 :onnx_node!Mul_271
0.0702 :onnx_node!Mul_46
0.0666 :onnx_node!Concat_190
0.0564 :onnx_node!Mul_297
0.0534 :onnx_node!Mul_180
0.0517 :onnx_node!Mul_70
0.048 :onnx_node!Mul_125
0.0459 :onnx_node!Mul_52
0.045 :onnx_node!Mul_59
0.0449 :onnx_node!Mul_93
...
Total: 36.8912
And here where it is faster.
5.7526 :onnx_node!Slice_19
3.2699 :onnx_node!Slice_29
3.2128 :onnx_node!Slice_9
2.785 :onnx_node!Slice_14
2.7736 :onnx_node!Slice_24
1.5678 :onnx_node!Slice_34
1.5364 :onnx_node!Slice_4
0.144 :onnx_node!Concat_40
0.1209 :onnx_node!Slice_39
0.0768 :onnx_node!Reshape_342
0.0626 :onnx_node!Mul_243
0.0615 :onnx_node!Mul_46
0.0584 :onnx_node!Mul_70
0.0562 :onnx_node!Reshape_361
0.0551 :onnx_node!Mul_180
0.0541 :onnx_node!Mul_297
0.0536 :onnx_node!Mul_43
0.0507 :onnx_node!Mul_125
0.0494 :onnx_node!Mul_52
0.0474 :onnx_node!Mul_271
...
Total: 27.696
I am not sure if this is a bug or not, but drop in performance is quite serious and it would be good to know what can cause this, so this can be documented.
Steps to reproduce
This happens for any Yolov5 model translated to onnx. The difference can be observed on any model. The bigger model, the bigger difference. On my machine I can reproduce this every time by installing a fresh WSL2 image and fresh docker image based on Ubuntu 20.04. However, as said earlier. Docker is not the problem here, neither the operating system or compiler.
Build settings:
cmake .. -D CMAKE_BUILD_TYPE=RELEASE \
-D WITH_IPP=OFF \
-D WITH_OPENGL=OFF \
-D WITH_QT=OFF \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D OPENCV_EXTRA_MODULES_PATH=../contrib/modules \
-D OPENCV_ENABLE_NONFREE=ON \
-D WITH_JASPER=OFF \
-D WITH_TBB=ON \
-D BUILD_JPEG=ON \
-D WITH_SIMD=ON \
-D WITH_FFMPEG=ON \
-D ENABLE_LIBJPEG_TURBO_SIMD=ON \
-D BUILD_DOCS=OFF \
-D BUILD_EXAMPLES=OFF \
-D BUILD_TESTS=OFF \
-D BUILD_PERF_TESTS=OFF \
-D BUILD_opencv_java=NO \
-D BUILD_opencv_python=NO \
-D BUILD_opencv_python2=NO \
-D BUILD_opencv_python3=NO \
-D BUILD_CUDA_STUBS=ON \
-D OPENCV_DNN_CUDA=ON \
-D WITH_CUDA=ON \
-D CUDA_ARCH_BIN=7.5 \
-D WITH_GTK=ON \
-D OPENCV_GENERATE_PKGCONFIG=ON \
Benchmarking code:
#include <opencv2/opencv.hpp>
int main(int argc, char* argv[])
{
cv::Mat img(1080, 1920, CV_8UC3);
cv::randu(img, cv::Scalar(0, 0, 0), cv::Scalar(255, 255, 255));
cv::dnn::Net net = cv::dnn::readNet("/home/test/dev/yolov5m_based.onnx"); // Modify accordingly.
net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);
cv::Mat blob;
cv::dnn::blobFromImage(img, blob, 1. / 255., cv::Size(640, 640), cv::Scalar(), false, false);
net.setInput(blob);
std::vector<cv::Mat> outputs;
// Don't measure this, GPU needs to warm up.
for (int i = 0; i < 5; ++i)
net.forward(outputs, "output0");
auto c1 = cv::getTickCount();
for (int i = 0; i < 200; ++i)
net.forward(outputs, "output0");
auto c2 = cv::getTickCount();
std::cout << "TOTAL TIME: " << ((c2 - c1) / cv::getTickFrequency() * 1000) << std::endl;
return 0;
}
- You must have Nvidia GPU.
- Install WSL2 image (Ubuntu 20.04).
- Install CUDA (11.8 and CuDNN 8.7 (don't install driver, install specific to WSL cuda packages).
- Build OpenCV with provided options (compiler does not matter).
- Build benchmarking program (compiler does not matter).
- Run program, on my machine TOTAL TIME: 7260.
- Install docker image based on Ubuntu 20.04 (you can do this from within WSL2!)
- Execute step 2,3,4.
- Run program, on my machine TOTAL TIME: 11061.
Of course there will be some small fluctuations in times but they will be very small.
You can obtain model from here
Issue submission checklist
- I report the issue, it's not a question
- I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- I updated to the latest OpenCV version and the issue is still there
- There is reproducer code and related data files (videos, images, onnx, etc)