restructuring CUDA backend to use its own computational graph and manage memory

##### System information (version)
- OpenCV => 4.5.4

### Internal Computational Graph

The importer's graph must be lowered into a new computational graph that is specific to the CUDA backend. This would enable better code design, less coupling, and better performance with the fine-grained control it would offer.

**Benefits:**
- CUDA backend code can be removed from `dnn.cpp` (fusions)
- tensors are uncoupled from the `BackendWrapper` system
  -  #15996
- graph scheduling
  - https://github.com/AlexeyAB/darknet/issues/7444
  - https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
  - https://github.com/opencv/opencv/pull/17236
  - YOLOv3 has three output blobs and the CUDA backend starts D2H transfer as soon as the blob is available (with the rest of the inference progressing in parallel). The topological ordering used in the DNN module computes the biggest blob last. Reordering it to compute the biggest blob first will reduce inference time.
- no irrelevant constraints on optimizations
  - #16306
- lower granularity of nodes; D2H, H2H, NCHW-NHWC conversion nodes, etc
  - these nodes are required to explicitly construct a CUDA graph
  - cuDNN converts between tensor formats because the CUDA backend doesn't support NHWC and tensor core convolutions require NHWC

### Kernels must be classes instead of functions

All the kernels currently have a CPU wrapper that invokes the kernel. The wrapper in many cases performs calculations (mostly negligible) for reducing tensor ranks, precomputing constants, etc. to improve kernel performance. These results can be cached in a class, unlike a function.

**Benefits:**
- amortize the CPU cost of executing kernels
- configuration to be saved into files (perhaps required for autotuning in the future?)
- can be incorporated into the computational graph
- easy to implement runtime compilation and autotuning
- cache performance information to do runtime autotuning during forward passes

Kernel arguments are stored in constant memory which is as fast as register access when all the threads in the warp access the same item (otherwise, the reads are serialized). A lot of precomputed information can be packed in a `struct` and passed onto the kernel. This frees up registers and also improves performance.

##### Issue submission checklist

 - [x] I report the issue, it's not a question
   
 - [x] I checked the problem with documentation, FAQ, open issues,
       forum.opencv.org, Stack Overflow, etc and have not found solution
   
 - [x] I updated to latest OpenCV version and the issue is still there
   
 - [ ] There is reproducer code and related data files: videos, images, onnx, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

restructuring CUDA backend to use its own computational graph and manage memory #20966

System information (version)

Internal Computational Graph

Kernels must be classes instead of functions

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

restructuring CUDA backend to use its own computational graph and manage memory #20966

Description

System information (version)

Internal Computational Graph

Kernels must be classes instead of functions

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions