add CropAndResize layer for CUDA backend#16069
Conversation
|
Is there a test for CropAndResize layer? I am not able to find one. Otherwise, the PR is ready and not WIP. |
|
Have you tried MaskRCNN with this patch? Last time without CUDA CropAndResize you used 7700HQ and GTX 1050 and your results were: Inception v2 Mask RCNN |
|
Results below are from the same device configuration (OS is Ubuntu 18.04 instead of Windows). The relative error is between OCV CPU output and CUDA output. The target used was Differences between previous setup and current setup:
I am not sure why there is a huge difference in the timings without CropAndResize back then and now. I am confused by the timing difference in the OCV CPU too. The compilers are different (MSVC 19.16 vs GCC 7.4) but it shouldn't be that dramatic? |
|
Didnt you say that using CPU fallback results in copying data back and forth from GPU and CPU? Btw: are there still many cpu fallbacks or are these models listed above now running completely on CUDA backend? Would be kind of interesting to have some debug function that shows the backend forwarding path for a model such that its easy to see which layers are not on the same (cuda) backend |
|
Set The only missing layer now is DetectionOutput (input layer is skipped as it's NOP for the models mentioned in this PR). By the nature of computations involved, I don't think it's worth porting it to GPU fully. It might help to move part of DetectionOutput to GPU and perform the final steps such as NMS on the CPU (like the way it's done for region layer). I'll have to dig through the code and think about it. Other optimizations are possible:
This is what I get on Windows. |
@YashasSamaga. thanks for this feature! There is no single layer test but there are tests for Faster R-CNNs with it. |
|
@dkurt I just realized after the notification about this PR's approval that maybe the optimizations added in PR16097 to the resize kernel is applicable here too. I would like to profile and check. If it doesn't cause trouble and you're ok with it, I can add a commit to this PR to optimize the kernel. Otherwise, I will make another PR with the optimization. I am sorry for the last-minute change. EDIT: got around 3.3x improvement with around 5.5ms reduction in Inception v2 Mask RCNN inference time on GTX 1050. Should I add the commit here or make another PR? |
|
@YashasSamaga, no problem. Let's push it here. |
04ac019 to
ce13070
Compare
|
@dkurt @YashasSamaga I encountered a strange problem, When I tested with the following code, I found that the inference time of the model is different, time1 = 700ms, time2 = 3ms. And when I run this code on CPU, the inference time = 300ms, It so weired. ` vector layersTimings2; |
|
@JTzhuang The first forward pass has initialization cost. You should ignore it. The subsequent forward passes should be faster. |
|
@YashasSamaga The first forward pass is 800ms, and the subsequent forward are about 700ms. |
|
@JTzhuang Please share the model you are using. |
|
@YashasSamaga |
|
@JTzhuang Sorry, I misread your initial post. The CUDA backend doesn't support It might be difficult to identify what operations correspond to which layer. CUDA allows profiling data to be annotated in the code using NVTX. The CUDA backend currently doesn't have any profiler annotation capabilities. If you think this could be a useful feature, please open a feature request issue. CPU: CPU outputCUDA: CUDA output |
|
@YashasSamaga So you think the inference cost on GPU is right? It's obvious that the inference time on GPU is much slower than CPU. |
|
Without With I tried to view your model in netron. No layer seems to be using Does your model have depthwise convolutions? |
|
@YashasSamaga No. My network has two path. One path is used to extract feature map from color image(color_input ), and another path for height image to extract feature maps in the same way. and resnet101 is used as the backbone in the two path. So there is no depthwise Convolution in my model. |
|
@YashasSamaga it' just like the Siamese Network. I don't think it would get the same output as two input if without color_input set. |
|
I added getPerfProfile support to the CUDA backend here. CodeThe first number in each line is the actual time it took for It appears like the CUDA backend is taking 42ms to compute the outputs. The remaining ~500ms is coming from somewhere. This looks like a bug. Please open an issue. I profiled using Nsight Systems. It looks like the CUDA backend completely reinitialized for the second forward pass. The reinitialization was totally unnecessary. It's as if every pass is like the first forward pass. I need to dig deeper and check but I think the bug is not in the CUDA backend. It is probably in the general initialization logic. |
|
@YashasSamaga Thank you for your support. waiting for your solution. |
|
@JTzhuang Please make an issue so that this bug (if at all) is tracked. |
|
@YashasSamaga I have opened a new issue. |
…esize add CropAndResize layer for CUDA backend * add CropAndResize layer * process multiple channels per iteration
This pullrequest changes
CropAndResizesupport for CUDA backendTimings (on GTX 1050):
^ more information can be found here