Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason?

### System Information

OpenCV version: 4.6.0
Operating System / Platform: Ubuntu 20.04 / Windows / WSL2 / Docker
Compiler & compiler version: GCC 9,11, MSVC 2017, 2019, Python 3.10
CUDA: 11.4, 11.8


### Detailed description

A word of preface:

I am observing up to 50% larger inference times on one environment compared to other. Both environments are running on the same hardware and OpenCV have been compiled the same way. What is interesting is that performance is constant regardless of operating system.
For example: I wrote a c++ executable and I've run it on a WSL2 image (Ubuntu 20.04/CUDA 11.8) where my inference time is 25 FPS and I have a docker image (Ubuntu 20.04/CUDA 11.8) where I compiled OpenCV exactly the same and inference time is 15 FPS. 

I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS.

So the problem is not within my code . and it does not depend on the operating system. Each time I've used the same OpenCV version with the same set of options. I am suspecting that maybe during compilation of OpenCV something sometimes is compiled differently?

I've dug deeper. I've profiled OpenCV, here are the results:

```
ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       34.539     1569.779       35.574       43.279    10430.354        0.000   0        0.000   0
```

```
ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0
                                                                                              58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0
```

This is the major difference. So I know the problem lies within dnn. I'looked at source and discovered that I can time "layers". So I did that.

Here are top 20 worst times (seconds) where inference is slower:
```
6.0574 :onnx_node!Slice_19
5.3924 :onnx_node!Slice_29
5.3768 :onnx_node!Slice_9
2.8331 :onnx_node!Slice_4
2.6924 :onnx_node!Slice_24
2.6597 :onnx_node!Slice_14
2.6464 :onnx_node!Slice_34
0.1466 :onnx_node!Concat_40
0.1194 :onnx_node!Slice_39
0.0923 :onnx_node!Mul_43
0.0827 :onnx_node!Mul_271
0.0702 :onnx_node!Mul_46
0.0666 :onnx_node!Concat_190
0.0564 :onnx_node!Mul_297
0.0534 :onnx_node!Mul_180
0.0517 :onnx_node!Mul_70
0.048 :onnx_node!Mul_125
0.0459 :onnx_node!Mul_52
0.045 :onnx_node!Mul_59
0.0449 :onnx_node!Mul_93
...
Total: 36.8912
```
And here where it is faster.
```
5.7526 :onnx_node!Slice_19
3.2699 :onnx_node!Slice_29
3.2128 :onnx_node!Slice_9
2.785 :onnx_node!Slice_14
2.7736 :onnx_node!Slice_24
1.5678 :onnx_node!Slice_34
1.5364 :onnx_node!Slice_4
0.144 :onnx_node!Concat_40
0.1209 :onnx_node!Slice_39
0.0768 :onnx_node!Reshape_342
0.0626 :onnx_node!Mul_243
0.0615 :onnx_node!Mul_46
0.0584 :onnx_node!Mul_70
0.0562 :onnx_node!Reshape_361
0.0551 :onnx_node!Mul_180
0.0541 :onnx_node!Mul_297
0.0536 :onnx_node!Mul_43
0.0507 :onnx_node!Mul_125
0.0494 :onnx_node!Mul_52
0.0474 :onnx_node!Mul_271

...
Total: 27.696
```
I am not sure if this is a bug or not, but drop in performance is quite serious and it would be good to know what can cause this, so this can be documented.

### Steps to reproduce

This happens for any Yolov5 model translated to onnx. The difference can be observed on any model. The bigger model, the bigger difference. On my machine I can reproduce this every time by installing a fresh WSL2 image and fresh docker image based on Ubuntu 20.04. However, as said earlier. Docker is not the problem here, neither the operating system or compiler.

Build settings:

```
cmake .. -D CMAKE_BUILD_TYPE=RELEASE \
    -D WITH_IPP=OFF \
    -D WITH_OPENGL=OFF \
    -D WITH_QT=OFF \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D OPENCV_EXTRA_MODULES_PATH=../contrib/modules \
    -D OPENCV_ENABLE_NONFREE=ON \
    -D WITH_JASPER=OFF \
    -D WITH_TBB=ON \
    -D BUILD_JPEG=ON \
    -D WITH_SIMD=ON \
    -D WITH_FFMPEG=ON \
    -D ENABLE_LIBJPEG_TURBO_SIMD=ON \
    -D BUILD_DOCS=OFF \
    -D BUILD_EXAMPLES=OFF \
    -D BUILD_TESTS=OFF \
    -D BUILD_PERF_TESTS=OFF \
    -D BUILD_opencv_java=NO \
    -D BUILD_opencv_python=NO \
    -D BUILD_opencv_python2=NO \
    -D BUILD_opencv_python3=NO \
    -D BUILD_CUDA_STUBS=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN=7.5 \
    -D WITH_GTK=ON \
    -D OPENCV_GENERATE_PKGCONFIG=ON \
```
Benchmarking code:
```
#include <opencv2/opencv.hpp>

int main(int argc, char* argv[])
{
  cv::Mat img(1080, 1920, CV_8UC3);
  cv::randu(img, cv::Scalar(0, 0, 0), cv::Scalar(255, 255, 255));

  cv::dnn::Net net = cv::dnn::readNet("/home/test/dev/yolov5m_based.onnx"); // Modify accordingly.
  net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
  net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

  cv::Mat blob;
  cv::dnn::blobFromImage(img, blob, 1. / 255., cv::Size(640, 640), cv::Scalar(), false, false);
  net.setInput(blob);
  std::vector<cv::Mat> outputs;

  // Don't measure this, GPU needs to warm up.
  for (int i = 0; i < 5; ++i)
    net.forward(outputs, "output0");

  auto c1 = cv::getTickCount();
  for (int i = 0; i < 200; ++i)
    net.forward(outputs, "output0");
  auto c2 = cv::getTickCount();

  std::cout << "TOTAL TIME: " << ((c2 - c1) / cv::getTickFrequency() * 1000) << std::endl;

  return 0;
}
```

0. You **must have** Nvidia GPU.
1. Install WSL2 image  (Ubuntu 20.04).
2. Install CUDA (11.8 and CuDNN 8.7 (don't install driver, install specific to WSL cuda packages).
3. Build OpenCV with provided options (compiler does not matter).
4. Build benchmarking program (compiler does not matter).
5. Run program, on my machine TOTAL TIME: **7260**.
6. Install docker image based on Ubuntu 20.04 (you can do this from within WSL2!)
7. Execute step 2,3,4.
8. Run program, on my machine TOTAL TIME: **11061**.

Of course there will be some small fluctuations in times but they will be very small.

You can obtain model from [here](https://drive.google.com/file/d/1zXR4ab_VulykpaNcHY8iRWzTPVJXXe0p/view?usp=sharing)

### Issue submission checklist

- [X] I report the issue, it's not a question
- [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- [x] I updated to the latest OpenCV version and the issue is still there
- [x] There is reproducer code and related data files (videos, images, onnx, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? #23223

System Information

Detailed description

Steps to reproduce

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? #23223

Description

System Information

Detailed description

Steps to reproduce

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions