-
-
Notifications
You must be signed in to change notification settings - Fork 56.5k
Description
System information (version)
- OpenCV => 4.5.4
Internal Computational Graph
The importer's graph must be lowered into a new computational graph that is specific to the CUDA backend. This would enable better code design, less coupling, and better performance with the fine-grained control it would offer.
Benefits:
- CUDA backend code can be removed from
dnn.cpp(fusions) - tensors are uncoupled from the
BackendWrappersystem - graph scheduling
- CUDA Graphs - +20% Detection faster on GPU (CUDA >= 10.1) AlexeyAB/darknet#7444
- https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
- dnn: Implement graph (gapi) version of forward functions #17236
- YOLOv3 has three output blobs and the CUDA backend starts D2H transfer as soon as the blob is available (with the rest of the inference progressing in parallel). The topological ordering used in the DNN module computes the biggest blob last. Reordering it to compute the biggest blob first will reduce inference time.
- no irrelevant constraints on optimizations
- lower granularity of nodes; D2H, H2H, NCHW-NHWC conversion nodes, etc
- these nodes are required to explicitly construct a CUDA graph
- cuDNN converts between tensor formats because the CUDA backend doesn't support NHWC and tensor core convolutions require NHWC
Kernels must be classes instead of functions
All the kernels currently have a CPU wrapper that invokes the kernel. The wrapper in many cases performs calculations (mostly negligible) for reducing tensor ranks, precomputing constants, etc. to improve kernel performance. These results can be cached in a class, unlike a function.
Benefits:
- amortize the CPU cost of executing kernels
- configuration to be saved into files (perhaps required for autotuning in the future?)
- can be incorporated into the computational graph
- easy to implement runtime compilation and autotuning
- cache performance information to do runtime autotuning during forward passes
Kernel arguments are stored in constant memory which is as fast as register access when all the threads in the warp access the same item (otherwise, the reads are serialized). A lot of precomputed information can be packed in a struct and passed onto the kernel. This frees up registers and also improves performance.
Issue submission checklist
- I report the issue, it's not a question
- I checked the problem with documentation, FAQ, open issues,
forum.opencv.org, Stack Overflow, etc and have not found solution - I updated to latest OpenCV version and the issue is still there
- There is reproducer code and related data files: videos, images, onnx, etc
Metadata
Metadata
Assignees
Labels
Type
Projects
Status