dnn: Implement graph (gapi) version of forward functions#17236
dnn: Implement graph (gapi) version of forward functions#17236Yougmark wants to merge 1 commit intoopencv:masterfrom
Conversation
Unroll the for loop of layers' forward function calls, using the gapi support for lambda functions provided in the cpu backend. Add test that retrieve per-layer results.
This warning is not triggerred by my patch. Thoughts? |
|
Hi! |
|
This is cool :) |
| { | ||
| static GMatDesc outMeta(GMatDesc in, cv::Ptr<LayerData> ) | ||
| { | ||
| return in; |
There was a problem hiding this comment.
Not sure if this method is correct...
The goal of outMeta is to let G-API know how to allocate the internal buffers to hold the results.
If your GLayer operation is a generic one, I assume the output dimensions change for DNN operators thorough the graph. I am wondering how it works now, but it seems you need to update the return value GMatDesc based on dimensions of the input (in) and details about the layer (LayerData)
There was a problem hiding this comment.
This is not relying on G-API to pass data between layers, yet. I agree that it is better implemented that way. Please see my comments on the PR for further explanation of why I am not able to do that.
This implementation relies on the Net::Impl itself to pass the data (or data pointers) using its internal data structures. This GLayer operation only wraps the forward function call. I hope this answer your question.
| if (it == layers.end() || (it->second.id > lastld.id)) | ||
| break; | ||
| i++; | ||
| } |
There was a problem hiding this comment.
Is it, hmm, correct?
What I see here is that we're building a linear pipeline of operators, but I assume real DL graphs are more complex than this?
There was a problem hiding this comment.
I agree that more complex networks shouldn't be supported by this linear pipeline, but the original forward function calls layers' forward function in a linear way (in a for loop) as well. I suppose the dnn module hasn't extended to those more complex networks yet.
I tested the existing networks in OpenCV by running opencv_dnn_test, and it seems okay so far --- tests of all networks pass. This patch doesn't include the code for testing the graph version of forward functions in opencv_dnn_test.
There was a problem hiding this comment.
The DNN module supports complex networks. Since the graphs are directed acyclic graphs (cyclic components like RNNs are a single computation nodes), you can always find a topological sorting. This linear pipeline is just a topological sorting of the graph.
This works really well at least for the CUDA backend since the individual operations in the pipeline fully saturate the GPU (in most cases) to an extent that you cannot extract much from concurrent execution (it's not possible in most cases since there is no extra space on the GPU). There are very few cases like concatenation of small tensors where individual concat kernels fail to fully utilize the GPU due to small work size but parallelizing these have a negligible effect on the overall performance (and sometimes the overhead leads to net loss).
There is also CUDA graphs which can be built form the computational graph. The CUDA runtime would take care of optimizing launch overhead, overlapping data transfers with computation, concurrent execution and a whole lot of other optimizations. It's currently not possible to implement this since cuDNN doesn't support CUDA graphs yet (there seems to be a plan to support them in cuDNN 8.0.x.x).
There was a problem hiding this comment.
If cyclic subgraphs are treated as single nodes, this should indeed take care of all kinds of NNs. I wonder how it is implemented; are layers fused into one?
You are right: in most cases, concurrent kernel execution (CKE) is negligible. Here I mention a few cases that CKE matters. High-end GPUs (e.g., TITAN X with 80 SMs) might still be underutilized. Also, when kernels are exiting, part of GPUs can be idle. In addition, pipelined execution using G-API can enable concurrency between CPU and GPU; for example, the CPU part that fetches new frames can be overlapped with GPU execution, although this isn't about CKE.
CUDA graphs are interesting. It seems to me that it would be natural to add G-API graph compiler that transforms G-API graphs into CUDA graphs. However, this is a far-future work to do as you mentioned that cuDNN doesn't support CUDA graphs until maybe 8.0.
Anyway, the pipelined execution of DNN, using G-API, needs further implementation. I think I will continue the implementation. This is my first time working on adding features into OpenCV, so any interest or guidance from the team (reviewers @dmatveev and @dkurt) can be big motivation and help for me.
A side note: I experimented graph scheduling with Darknet and have seen promising throughput improvement; that's why I think it worths (and is even better) to implement the combination of DNN+G-API in OpenCV. However, this patch itself is not enough to show throughput benefits.
There was a problem hiding this comment.
The cycles in RNN can be expended as N repetitive steps. For efficiency purposes, the RNN (and all its derivatives) are more efficiently implemented as a single node.
Currently, the CUDA backend only dumps operations into a stream (a pipeline) in forwardLayer, i.e. forwardLayer returns immediately after dumping the operations. All the waiting for the pipeline to finish execution happens at the end of the forwardToLayer. CUDA runtime optimizes the stream execution by minimizing the launch overhead.
Currently, the pipeline is linear. I have attempted in the past to construct a graph using streams and events and let the CUDA runtime handle the scheduling of this graph. The overheads lead to mostly no improvement or a net loss. My implementation is pathetic and it was an attempt to have a proof of concept prototype. I eventually compared CUDA graphs with a graph implemented using streams and events. The CUDA graph-based implementation fares better (with proxy kernels for cuDNN) than the manually constructed graph but unfortunately, cuDNN has issues with CUDA graphs.
There was a problem hiding this comment.
I read your code in your cuda4dnn-concurrent branch. If I understand correctly, per-layer streams are created to handle asynchronous pipelined execution, using events to enforce dependencies between layers as you mentioned. I have a couple of questions. Would the processing of different inputs be serialized as each layer has only one event recorder? For example, the processing of image (x+1) must wait for the processing of image (x) to finish at each layer. Another question is about forwardAsync, is the returned AyncArray used to pull for results asynchronously? I'm not familiar with AsyncPromise. Thanks!
Can I see your evaluation results if you have some already?
I suppose using events supports generic networks including non-linear ones. But
I wonder if an alternative approach using per-input streams would work better for linear networks as events can be avoided. Then for non-linear networks, a graph scheduling framework like G-API is required to enforce nodes/layers dependencies.
| { | ||
| if (!ld->flag) | ||
| this->forwardLayer(*ld); | ||
| }); |
Hi, That's a good question. TL;DR: Current implementation (this patch) does not benefit from using G-API. I expect using G-API improves throughput DNN in the future, with this patch as a first step. I am thinking about this patch as the first step in supporting graph scheduling of DNN in OpenCV (I'll talk more about this below). Graph scheduling helps improve throughput performance through pipelining and parallel execution, which I'm not sure if G-API has implemented (I haven't looked into I have concerns regarding always using G-API when it's available. I suppose G-API comes with overheads, compared to the for loop that iterates over layers' forward function calls. That's why I didn't replace the original More on supporting graph scheduling for DNN using G-API: Therefore, I switched to this simpler approach as a start point. I'm happy to hear and would appreciate any feedback from the team. Thanks, |
|
@Yougmark, thanks! Can you please start with some POC and show the improvement in throughput? You may try to use perf tests engine: https://github.com/opencv/opencv/wiki/HowToWritePerfTests + https://github.com/opencv/opencv/wiki/HowToUsePerfTests |
This patch doesn't improve throughput so I'm not able to write a perf test to show that. Sorry if my previous reply causes any confusion. |
|
@Yougmark thanks for your extended reply. Please ask questions if you need help implementing your backends with G-API. |
@dmatveev Thanks! I'm looking into |
|
The discussions in #16306 might be relevant. |
|
@Yougmark I see you're pretty familiar with our framework internals and understand the concept of islands and their execution pretty well. That's cool, I really appreciate it! Let me summarize it a bit. We have two graph models, a
This structure is heterogeneous by default and we have two levels of execution:
There are two modes of execution defined at (2): the regular mode (the product of The regular mode is so-far serial, but @anton-potapov has his TBB-based executor coming in #17170. This executor will enable graph-level parallelism for the regular ( The streaming mode is threaded as you see, and implements graph-level pipelining. The pipelining makes most sense when different stages of your pipeline can run in parallel (that is, those are heterogeneous too). It shows improvement even for linear structures like And here comes the main question - what kind of performance effect you really want to achieve on your target system? |
|
Thanks for your summary, @dmatveev! It is very helpful. To answer your question,
I want to improve the throughput of neural networks that handle multiple data streams, by exploring parallelism on GPUs through executing multiple kernels simultaneously, and intra-node parallelism on CPUs through multi-threading. So I need finer-grained pipelined execution between nodes, compared to the existing Another goal of mine is not as relevant: I also want to use OpenCV in real-time systems that require managing computation workloads as real-time tasks, using SCHED_DEADLINE scheduler in Linux or other real-time operating systems (RTOSs). One G-API node per real-time task is the granularity I want. That takes at least one thread to represent each node. These two goals are orthogonal, and I think the first might be more interesting to the OpenCV team. I'm gonna work on the executor for a while. Do you use an email list or chat app like Slack where I can ask questions during the development? |
Thanks, it is more clear now. Sounds challenging though :) Can modern GPUs execute multiple kernels simultaneously? I'm just not following the SOTA there.
The proper and so, the G-API-ish way to do that is to write a new backend and map the graph structure to your own execution model. Then the TBB executor could really help, but you'll need to adapt it within your backend (by default it will be used at a higher level, to schedule islands). Having the stuff consolidated within your own backend also solves the memory management problem.
That's really interesting :) Please keep us informed with your progress. While this was not the goal for G-API to support, it sounds as mapping to the G-API model and nature pretty well. |
|
@dmatveev Thanks for your suggestions! I agree that a backend for my execution model is the way to go. I'm not familiar with TBB, though, and after skimming about TBB, I have a couple of questions:
If not, I might need to implement non-busy waiting and persistent threads on my own. Regarding the question about the simultaneous execution on GPU, yes, you can look up concurrent kernel execution, which is supported since NVIDIA Fermi architecture, and similar concurrency is supported on AMD GPUs as well. |
|
@Yougmark sorry I've missed your comment. @anton-potapov can you have a look on the questions? |
|
@Yougmark Could you rebase the patch to fix merge conflicts. |
|
@Yougmark @anton-potapov Do you have chance to finish the PR? Is it still relevant? |
|
Hi! Still there is no clear understanding if this implementation can give permormance/memory optimizations. |
|
Closing this PR for now. |
Unroll the for loop of layers' forward function calls, using the gapi
support for lambda functions provided in the CPU backend.
Add a test that retrieves per-layer results.
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.