Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend

### System Information

OpenCV versions tested: 4.5.5, 4.7.0
Operating System / Platform: Ubuntu 18.04
Device: NVIDIA Jetson TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1

### Detailed description

Hi,

I was using OpenCV 4.5.5, backend CUDA on a NVIDIA Jetson TX2 Devkit with the specs defined above. A couple of days I decided to update to OpenCV 4.7.0 to check if I had some boost in performance for the models I'm currently using. However what I did saw was a performance loss (in terms of execution time) for the majority of the models. Do you know what is the reason for this loss of performance?

This is the execution times obtained for both versions of OpenCV:

<html>
<body>
<h2>Test 1</h2><ul><li>Device: TX2 DevKit</li><li>CUDA version: 10.2</li><li>CUDNN version: 8.2.1</li><li>OpenCV version: 4.7.0</li></ul> 

Version | Model 1 | Model 2 | Model 3 | Model 4
-- | -- | -- | -- | --
Input Size | (112, 112) | (112, 112) | (112, 112) | (112, 112)
Model Architecture | Resnet100 | MobileFaceNet | Resnet100 | Resnet18
Jetson CPU | 702 | 20.5 | 699 | 167
Jetson GPU | **91.7** | **10.5** | **91.6** | **52.2**


</body>
</html>

<html>
<body>
<h2>Test 2</h2><ul><li>Device: TX2 DevKit</li><li>CUDA version: 10.2</li><li>CUDNN version: 8.2.1</li><li>OpenCV version: 4.5.5</li></ul> 

Version | Model 1 | Model 2 | Model 3 | Model 4
-- | -- | -- | -- | --
Input Size | (112, 112) | (112, 112) | (112, 112) | (112, 112)
Model Architecture | Resnet100 | MobileFaceNet | Resnet100 | Resnet18
Jetson CPU | 1088 | 23.1 | 1096 | 257
Jetson GPU | **60.9** | **5.34** | **60.7** | **19.9**


</body>
</html>

**Note:** Both tests were built with the same OpenCV flags and requirements, the only thing that changed was the version of both opencv and opencv_contrib. Moreover, all the execution times presented in those tables are in ms.

### Steps to reproduce

You can use this piece of code to reproduce this issue/loss of performance:

```
#include <thread>
#include <fstream>

#include <opencv2/imgproc.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>

int main(int argc, char** argv)
{
 auto imageToTest = argv[1];
 auto modelToTest = argv[2];
 int modelInputWidth = atoi(argv[3]);
 int modelInputHeight = atoi(argv[4]);

 cv::Size currSize = cv::Size(modelInputWidth, modelInputHeight);
 std::string modelToTestOnnx = modelToTest;
 std::string imagefilename = imageToTest;
 unsigned int num_inferences = 100;

 cv::dnn::Net net = cv::dnn::readNetFromONNX(modelToTestOnnx);

 net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
 net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

 cv::Mat img = cv::imread(imagefilename, cv::IMREAD_ANYCOLOR);
 cv::Mat resized;
 cv::resize(img, resized, currSize);
	
	std::vector<cv::Mat> imgBatch = { resized };
 bool swaprbchannels = false;
 cv::Mat blob = cv::dnn::blobFromImages(imgBatch, 1.0f / 255.0f, cv::Size(), cv::Scalar(), swaprbchannels, false, CV_32F);

 net.setInput(blob);

 std::vector<cv::String> unconnectedOutLayerNames = net.getUnconnectedOutLayersNames();

 std::vector<cv::Mat> outputs;
 outputs.clear();

 auto timeLoadModelPlusInference1 = std::chrono::high_resolution_clock::now();

 net.forward(outputs, unconnectedOutLayerNames);

 auto timeLoadModelPlusInference2 = std::chrono::high_resolution_clock::now();

 std::chrono::duration<double, std::milli> ms_doubleTimeLoadModelPlusInference = timeLoadModelPlusInference2 - timeLoadModelPlusInference1;

 std::cout << "Execution time (load model + inference): " << ms_doubleTimeLoadModelPlusInference.count() << std::endl; // in ms

 auto time1 = std::chrono::high_resolution_clock::now();

 try {
 for (size_t i = 0; i < num_inferences; i++)
		net.forward(outputs, unconnectedOutLayerNames);
 }
 catch (std::exception& ex)
 {
 std::cout << ex.what() << std::endl;
 }

 auto time2 = std::chrono::high_resolution_clock::now();

 std::chrono::duration<double, std::milli> ms_double = time2 - time1;
 std::cout << "Execution time inference only: " << ms_double.count() / num_inferences << std::endl; // in ms

 std::cout << "Outputs Size: " << outputs[0].size[0] << "x" << outputs[0].size[1] << std::endl;
 std::cout << "Outputs value: " << outputs[0] << std::endl;
}

```
 

### Issue submission checklist

- [X] I report the issue, it's not a question
- [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- [X] I updated to the latest OpenCV version and the issue is still there
- [X] There is reproducer code and related data files (videos, images, onnx, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

System Information

Detailed description

Test 1

Test 2

Steps to reproduce

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version	Model 1	Model 2	Model 3	Model 4
Input Size	(112, 112)	(112, 112)	(112, 112)	(112, 112)
Model Architecture	Resnet100	MobileFaceNet	Resnet100	Resnet18
Jetson CPU	702	20.5	699	167
Jetson GPU	91.7	10.5	91.6	52.2

Uh oh!

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

Description

System Information

Detailed description

Test 1

Test 2

Steps to reproduce

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions