Skip to content

cuda4dnn: improve host-device transfer performance#16230

Merged
opencv-pushbot merged 1 commit intoopencv:masterfrom
YashasSamaga:cuda4dnn-fp-conversion
Jan 5, 2020
Merged

cuda4dnn: improve host-device transfer performance#16230
opencv-pushbot merged 1 commit intoopencv:masterfrom
YashasSamaga:cuda4dnn-fp-conversion

Conversation

@YashasSamaga
Copy link
Copy Markdown
Contributor

@YashasSamaga YashasSamaga commented Dec 24, 2019

This pullrequest changes

  • performs fp conversions on GPU
  • eliminates the costly intermediate host memory allocation required for every transfer involving fp16 data
Test Size = number of 32-bit floats

Test Size: 100
	CPU Conversion H2D Time: 0.0088ms
	GPU Conversion H2D Time: 0.0148ms
	CPU Conversion D2H Time: 0.0077ms
	GPU Conversion D2H Time: 0.0098ms
Test Size: 1000
	CPU Conversion H2D Time: 0.0068ms
	GPU Conversion H2D Time: 0.0147ms
	CPU Conversion D2H Time: 0.0078ms
	GPU Conversion D2H Time: 0.0098ms
Test Size: 10000
	CPU Conversion H2D Time: 0.0125ms
	GPU Conversion H2D Time: 0.0187ms
	CPU Conversion D2H Time: 0.0140ms
	GPU Conversion D2H Time: 0.0123ms
Test Size: 100000
	CPU Conversion H2D Time: 0.0622ms
	GPU Conversion H2D Time: 0.0496ms
	CPU Conversion D2H Time: 0.0780ms
	GPU Conversion D2H Time: 0.0423ms
Test Size: 1000000
	CPU Conversion H2D Time: 0.4920ms
	GPU Conversion H2D Time: 0.3530ms
	CPU Conversion D2H Time: 0.6348ms
	GPU Conversion D2H Time: 0.3262ms
Test Size: 10000000
	CPU Conversion H2D Time: 8.5243ms
	GPU Conversion H2D Time: 3.5249ms
	CPU Conversion D2H Time: 7.6002ms
	GPU Conversion D2H Time: 3.1775ms

It's slower to perform the conversion on GPU for small data sizes instead of converting on CPU and then transfering. But the difference is in the order of few microseconds to few tens of microseconds. However, for large data sizes, the GPU beats CPU by a large margin.

force_builders=Custom,docs
buildworker:Custom=linux-4
docker_image:Custom=ubuntu-cuda:18.04

build_image:Custom Mac=openvino-2019r3.0
build_image:Custom Win=openvino-2019r3.0
test_opencl:Custom Win=OFF
test_modules:Custom Mac=dnn,java,python3

@YashasSamaga YashasSamaga force-pushed the cuda4dnn-fp-conversion branch from 11675c2 to f5fe63f Compare December 25, 2019 08:11
@YashasSamaga YashasSamaga changed the title cuda4dnn: improve host-device transfer performance [WIP] cuda4dnn: improve host-device transfer performance Dec 27, 2019
@YashasSamaga YashasSamaga force-pushed the cuda4dnn-fp-conversion branch 2 times, most recently from 23442dd to ab06364 Compare December 28, 2019 05:15
@YashasSamaga YashasSamaga changed the title [WIP] cuda4dnn: improve host-device transfer performance cuda4dnn: improve host-device transfer performance Dec 29, 2019
@YashasSamaga YashasSamaga force-pushed the cuda4dnn-fp-conversion branch from ab06364 to 01f97f1 Compare December 29, 2019 18:50
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants