Skip to content

cuda4dnn: do not create temporary unused handles#17349

Merged
opencv-pushbot merged 1 commit intoopencv:masterfrom
YashasSamaga:cuda4dnn-general-fixes
May 23, 2020
Merged

cuda4dnn: do not create temporary unused handles#17349
opencv-pushbot merged 1 commit intoopencv:masterfrom
YashasSamaga:cuda4dnn-general-fixes

Conversation

@YashasSamaga
Copy link
Copy Markdown
Contributor

@YashasSamaga YashasSamaga commented May 22, 2020

Layers that require cuDNN or cuBLAS capabilities have handles as member object. A useless handle is created when they are default constructed. They are going to be replaced by the actual handle later. Handle creation is costly. Hence, all the temporary handles waste resources.

Each cv::dnn::Net object now creates exactly one stream, one cuDNN handle and one cuBLAS handle.

This PR makes default constructed handle objects not create any handle. A special constructor must be used to create a handle.

Model init time (master) init time (with this PR)
ResNet 50 260ms 170ms
Enet Cityscapes 170ms 30ms
YOLOv4 630ms 440ms
Inception v2 Mask RCNN 520ms 340ms

Others:

  • drop support for default streams (for better error detection)
  • do not allow moving empty handles into other handles (for better error detection)
  • fix handle leak (no bugs yet since the bugged move constructor isn't being used)
  • avoid creating shared pointers for empty UniqueHandle objects

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under OpenCV (BSD) License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04
buildworker:iOS=macosx-2

@YashasSamaga
Copy link
Copy Markdown
Contributor Author

I had planned on fixing getPerfProfile for the CUDA backend. Currently, the layerwise timings returned by getPerfProfile is the time taken for dumping the layer operation into the CUDA stream. It does not tell how much time the operation actually took.

The only way to allow proper getPerfProfile result is to place events before and after every layer in the CUDA stream. But this comes at a cost which appears to be on average ~1% (need to redo calculations on idle system) on the inference time. I am not sure what to do.

Many samples and tests use getPerfProfile. They all lead to wrong results when the CUDA backend is used.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants