cuda4dnn: do not create temporary unused handles by YashasSamaga · Pull Request #17349 · opencv/opencv

YashasSamaga · 2020-05-22T14:55:48Z

Layers that require cuDNN or cuBLAS capabilities have handles as member object. A useless handle is created when they are default constructed. They are going to be replaced by the actual handle later. Handle creation is costly. Hence, all the temporary handles waste resources.

Each cv::dnn::Net object now creates exactly one stream, one cuDNN handle and one cuBLAS handle.

This PR makes default constructed handle objects not create any handle. A special constructor must be used to create a handle.

Model	init time (master)	init time (with this PR)
ResNet 50	260ms	170ms
Enet Cityscapes	170ms	30ms
YOLOv4	630ms	440ms
Inception v2 Mask RCNN	520ms	340ms

Others:

drop support for default streams (for better error detection)
do not allow moving empty handles into other handles (for better error detection)
fix handle leak (no bugs yet since the bugged move constructor isn't being used)
avoid creating shared pointers for empty UniqueHandle objects

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under OpenCV (BSD) License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04
buildworker:iOS=macosx-2

YashasSamaga · 2020-05-22T15:07:02Z

I had planned on fixing getPerfProfile for the CUDA backend. Currently, the layerwise timings returned by getPerfProfile is the time taken for dumping the layer operation into the CUDA stream. It does not tell how much time the operation actually took.

The only way to allow proper getPerfProfile result is to place events before and after every layer in the CUDA stream. But this comes at a cost which appears to be on average ~1% (need to redo calculations on idle system) on the inference time. I am not sure what to do.

Many samples and tests use getPerfProfile. They all lead to wrong results when the CUDA backend is used.

alalek

Thank you!

do not create redundant handles

57ca106

alalek approved these changes May 23, 2020

View reviewed changes

opencv-pushbot merged commit 6b0fff7 into opencv:master May 23, 2020

YashasSamaga mentioned this pull request May 24, 2020

cuda4dnn(concat): fix stream not being set for concat wrappers after fusion #17359

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda4dnn: do not create temporary unused handles#17349

cuda4dnn: do not create temporary unused handles#17349
opencv-pushbot merged 1 commit intoopencv:masterfrom
YashasSamaga:cuda4dnn-general-fixes

YashasSamaga commented May 22, 2020 •

edited

Loading

Uh oh!

YashasSamaga commented May 22, 2020

Uh oh!

alalek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

YashasSamaga commented May 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

YashasSamaga commented May 22, 2020

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YashasSamaga commented May 22, 2020 •

edited

Loading