Skip to content

restructuring CUDA backend to use its own computational graph and manage memory #20966

@YashasSamaga

Description

@YashasSamaga
System information (version)
  • OpenCV => 4.5.4

Internal Computational Graph

The importer's graph must be lowered into a new computational graph that is specific to the CUDA backend. This would enable better code design, less coupling, and better performance with the fine-grained control it would offer.

Benefits:

Kernels must be classes instead of functions

All the kernels currently have a CPU wrapper that invokes the kernel. The wrapper in many cases performs calculations (mostly negligible) for reducing tensor ranks, precomputing constants, etc. to improve kernel performance. These results can be cached in a class, unlike a function.

Benefits:

  • amortize the CPU cost of executing kernels
  • configuration to be saved into files (perhaps required for autotuning in the future?)
  • can be incorporated into the computational graph
  • easy to implement runtime compilation and autotuning
  • cache performance information to do runtime autotuning during forward passes

Kernel arguments are stored in constant memory which is as fast as register access when all the threads in the warp access the same item (otherwise, the reads are serialized). A lot of precomputed information can be packed in a struct and passed onto the kernel. This frees up registers and also improves performance.

Issue submission checklist
  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues,
    forum.opencv.org, Stack Overflow, etc and have not found solution
  • I updated to latest OpenCV version and the issue is still there
  • There is reproducer code and related data files: videos, images, onnx, etc

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions