Skip to content

dnn: Implement graph (gapi) version of forward functions#17236

Closed
Yougmark wants to merge 1 commit intoopencv:masterfrom
Yougmark:gapi_forward_dnn
Closed

dnn: Implement graph (gapi) version of forward functions#17236
Yougmark wants to merge 1 commit intoopencv:masterfrom
Yougmark:gapi_forward_dnn

Conversation

@Yougmark
Copy link
Copy Markdown

@Yougmark Yougmark commented May 7, 2020

Unroll the for loop of layers' forward function calls, using the gapi
support for lambda functions provided in the CPU backend.

Add a test that retrieves per-layer results.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under OpenCV (BSD) License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Unroll the for loop of layers' forward function calls, using the gapi
support for lambda functions provided in the cpu backend.

Add test that retrieve per-layer results.
@Yougmark
Copy link
Copy Markdown
Author

Yougmark commented May 7, 2020

57>C:\build\precommit_windows64\opencv\modules\dnn\src\dnn.cpp(355): warning C4244: 'argument': conversion from 'double' to 'int', possible loss of data [C:\build\precommit_windows64\build\modules\dnn\opencv_dnn.vcxproj]

This warning is not triggerred by my patch. Thoughts?

@dkurt
Copy link
Copy Markdown
Member

dkurt commented May 8, 2020

Hi!
Can you please describe which benefits of using G-API in dnn we have? Maybe it make sense to make G-API optional and if it's available use always without separate method?

@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented May 8, 2020

This is cool :)

{
static GMatDesc outMeta(GMatDesc in, cv::Ptr<LayerData> )
{
return in;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this method is correct...

The goal of outMeta is to let G-API know how to allocate the internal buffers to hold the results.

If your GLayer operation is a generic one, I assume the output dimensions change for DNN operators thorough the graph. I am wondering how it works now, but it seems you need to update the return value GMatDesc based on dimensions of the input (in) and details about the layer (LayerData)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not relying on G-API to pass data between layers, yet. I agree that it is better implemented that way. Please see my comments on the PR for further explanation of why I am not able to do that.

This implementation relies on the Net::Impl itself to pass the data (or data pointers) using its internal data structures. This GLayer operation only wraps the forward function call. I hope this answer your question.

if (it == layers.end() || (it->second.id > lastld.id))
break;
i++;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it, hmm, correct?

What I see here is that we're building a linear pipeline of operators, but I assume real DL graphs are more complex than this?

Copy link
Copy Markdown
Author

@Yougmark Yougmark May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that more complex networks shouldn't be supported by this linear pipeline, but the original forward function calls layers' forward function in a linear way (in a for loop) as well. I suppose the dnn module hasn't extended to those more complex networks yet.

I tested the existing networks in OpenCV by running opencv_dnn_test, and it seems okay so far --- tests of all networks pass. This patch doesn't include the code for testing the graph version of forward functions in opencv_dnn_test.

Copy link
Copy Markdown
Contributor

@YashasSamaga YashasSamaga May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNN module supports complex networks. Since the graphs are directed acyclic graphs (cyclic components like RNNs are a single computation nodes), you can always find a topological sorting. This linear pipeline is just a topological sorting of the graph.

This works really well at least for the CUDA backend since the individual operations in the pipeline fully saturate the GPU (in most cases) to an extent that you cannot extract much from concurrent execution (it's not possible in most cases since there is no extra space on the GPU). There are very few cases like concatenation of small tensors where individual concat kernels fail to fully utilize the GPU due to small work size but parallelizing these have a negligible effect on the overall performance (and sometimes the overhead leads to net loss).

There is also CUDA graphs which can be built form the computational graph. The CUDA runtime would take care of optimizing launch overhead, overlapping data transfers with computation, concurrent execution and a whole lot of other optimizations. It's currently not possible to implement this since cuDNN doesn't support CUDA graphs yet (there seems to be a plan to support them in cuDNN 8.0.x.x).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cyclic subgraphs are treated as single nodes, this should indeed take care of all kinds of NNs. I wonder how it is implemented; are layers fused into one?

You are right: in most cases, concurrent kernel execution (CKE) is negligible. Here I mention a few cases that CKE matters. High-end GPUs (e.g., TITAN X with 80 SMs) might still be underutilized. Also, when kernels are exiting, part of GPUs can be idle. In addition, pipelined execution using G-API can enable concurrency between CPU and GPU; for example, the CPU part that fetches new frames can be overlapped with GPU execution, although this isn't about CKE.

CUDA graphs are interesting. It seems to me that it would be natural to add G-API graph compiler that transforms G-API graphs into CUDA graphs. However, this is a far-future work to do as you mentioned that cuDNN doesn't support CUDA graphs until maybe 8.0.

Anyway, the pipelined execution of DNN, using G-API, needs further implementation. I think I will continue the implementation. This is my first time working on adding features into OpenCV, so any interest or guidance from the team (reviewers @dmatveev and @dkurt) can be big motivation and help for me.

A side note: I experimented graph scheduling with Darknet and have seen promising throughput improvement; that's why I think it worths (and is even better) to implement the combination of DNN+G-API in OpenCV. However, this patch itself is not enough to show throughput benefits.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cycles in RNN can be expended as N repetitive steps. For efficiency purposes, the RNN (and all its derivatives) are more efficiently implemented as a single node.

Currently, the CUDA backend only dumps operations into a stream (a pipeline) in forwardLayer, i.e. forwardLayer returns immediately after dumping the operations. All the waiting for the pipeline to finish execution happens at the end of the forwardToLayer. CUDA runtime optimizes the stream execution by minimizing the launch overhead.

Currently, the pipeline is linear. I have attempted in the past to construct a graph using streams and events and let the CUDA runtime handle the scheduling of this graph. The overheads lead to mostly no improvement or a net loss. My implementation is pathetic and it was an attempt to have a proof of concept prototype. I eventually compared CUDA graphs with a graph implemented using streams and events. The CUDA graph-based implementation fares better (with proxy kernels for cuDNN) than the manually constructed graph but unfortunately, cuDNN has issues with CUDA graphs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read your code in your cuda4dnn-concurrent branch. If I understand correctly, per-layer streams are created to handle asynchronous pipelined execution, using events to enforce dependencies between layers as you mentioned. I have a couple of questions. Would the processing of different inputs be serialized as each layer has only one event recorder? For example, the processing of image (x+1) must wait for the processing of image (x) to finish at each layer. Another question is about forwardAsync, is the returned AyncArray used to pull for results asynchronously? I'm not familiar with AsyncPromise. Thanks!

Can I see your evaluation results if you have some already?

I suppose using events supports generic networks including non-linear ones. But
I wonder if an alternative approach using per-input streams would work better for linear networks as events can be avoided. Then for non-linear networks, a graph scheduling framework like G-API is required to enforce nodes/layers dependencies.

{
if (!ld->flag)
this->forwardLayer(*ld);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 . @TolyaTalamanov must be happy now. :D

@Yougmark
Copy link
Copy Markdown
Author

Yougmark commented May 8, 2020

Hi!
Can you please describe which benefits of using G-API in dnn we have? Maybe it make sense to make G-API optional and if it's available use always without separate method?

Hi,

That's a good question.

TL;DR: Current implementation (this patch) does not benefit from using G-API. I expect using G-API improves throughput DNN in the future, with this patch as a first step.
G-API adds overheads, compared to the for-loop in the forwardToLayer function so it might be good to keep both implementations separated for now.

I am thinking about this patch as the first step in supporting graph scheduling of DNN in OpenCV (I'll talk more about this below). Graph scheduling helps improve throughput performance through pipelining and parallel execution, which I'm not sure if G-API has implemented (I haven't looked into GStreamingExecutor yet). So I think the answer to the first question is that using G-API in DNN has the potential of throughput improvement through graph scheduling. In particular, this is more applicable to DNNs that do not rely on historical results; for example, object detection DNNs.

I have concerns regarding always using G-API when it's available. I suppose G-API comes with overheads, compared to the for loop that iterates over layers' forward function calls. That's why I didn't replace the original Net::forward function, besides the intention of making it easier for getting this PR accepted :).

More on supporting graph scheduling for DNN using G-API:
This for-loop-unrolling approach is not my original plan. I was thinking about adding dnn APIs in gapi, as with imgproc. I wanted to start with a CUDA backend because I have a NVIDIA GPU. However, the CUDABackendNode in dnn module is not exported, thus preventing the usage of it in gapi module. Also, to build a DNN using the existing import functions (such as readNetFromDarknet) then causes circular dependencies between dnn and gapi modules. I figured this plan (of having a dnn API and a CUDA backend in gapi) won't work without modifying the dnn module by a lot.

Therefore, I switched to this simpler approach as a start point. I'm happy to hear and would appreciate any feedback from the team.

Thanks,
Ming

@dkurt
Copy link
Copy Markdown
Member

dkurt commented May 8, 2020

@Yougmark, thanks! Can you please start with some POC and show the improvement in throughput? You may try to use perf tests engine: https://github.com/opencv/opencv/wiki/HowToWritePerfTests + https://github.com/opencv/opencv/wiki/HowToUsePerfTests

@Yougmark
Copy link
Copy Markdown
Author

Yougmark commented May 8, 2020

@Yougmark, thanks! Can you please start with some POC and show the improvement in throughput? You may try to use perf tests engine: https://github.com/opencv/opencv/wiki/HowToWritePerfTests + https://github.com/opencv/opencv/wiki/HowToUsePerfTests

This patch doesn't improve throughput so I'm not able to write a perf test to show that. Sorry if my previous reply causes any confusion.
Would you suggest that I wait until I can show throughput improvement to come back on this PR?

@dmatveev
Copy link
Copy Markdown
Contributor

@Yougmark thanks for your extended reply. Please ask questions if you need help implementing your backends with G-API.

@Yougmark
Copy link
Copy Markdown
Author

@Yougmark thanks for your extended reply. Please ask questions if you need help implementing your backends with G-API.

@dmatveev Thanks! I'm looking into GStreamingExecutor. It seems to support pipelined execution but at a one-island-per-stage granularity. For now, I still want to use GCPUBackend for decomposing DNNs into G-API graphs (using this patch), and to enable pipelined execution at a one-layer-per-stage (i.e., one-node-per-stage) granularity, I think I need to implement a new executor. Is this a fine way to move forward? Or would you suggest a different approach for decomposing DNN graphs or enabling finer-grained pipelined execution?

@YashasSamaga
Copy link
Copy Markdown
Contributor

The discussions in #16306 might be relevant.

@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented May 15, 2020

@Yougmark I see you're pretty familiar with our framework internals and understand the concept of islands and their execution pretty well. That's cool, I really appreciate it!

Let me summarize it a bit. We have two graph models, a GModel and a GIslandModel:

  • GModel is the compiler's primary internal representation of the graph, we use it to validate the graph, understand its structure, assign it to backends, and so on.
  • GIslandModel is a projection from GModel where we fuse GModel nodes into clusters ("islands") based on their backend affinity.

This structure is heterogeneous by default and we have two levels of execution:

  1. The lower level, where backends execute their subgraphs (islands). Every backend is in charge in providing its own execution logic under the hood.
  2. The upper level, where graph of islands (and so, backends) is executed. The upper level is backend-neutral. G-API implements common logic to schedule graph of islands, and backends are asked to run their parts in their own manner when G-API asks them.

There are two modes of execution defined at (2): the regular mode (the product of compile()), and the streaming mode (the product of compileStreaming()).

The regular mode is so-far serial, but @anton-potapov has his TBB-based executor coming in #17170. This executor will enable graph-level parallelism for the regular (GExecutor) thing. graph-level parallelism makes sense when we have branches in the graph which can be scheduled in parallel, which applies to an island model perfectly if there's more than one island. This parallelism wouldn't give much if our graph is linear (a->b->c).

The streaming mode is threaded as you see, and implements graph-level pipelining. The pipelining makes most sense when different stages of your pipeline can run in parallel (that is, those are heterogeneous too). It shows improvement even for linear structures like a->b->c if and only if a and b and c can run in parallel with no effect on each other (e.g., decoding with a hardware decoder, then running inference on GPU, then running post-processing/analysis/etc on CPU).

And here comes the main question - what kind of performance effect you really want to achieve on your target system?

@Yougmark
Copy link
Copy Markdown
Author

Thanks for your summary, @dmatveev! It is very helpful. To answer your question,

And here comes the main question - what kind of performance effect you really want to achieve on your target system?

I want to improve the throughput of neural networks that handle multiple data streams, by exploring parallelism on GPUs through executing multiple kernels simultaneously, and intra-node parallelism on CPUs through multi-threading. So I need finer-grained pipelined execution between nodes, compared to the existing GStreamingExecutor that supports pipelined execution between islands. #17170 seems very relevant, and I'll look into that.

Another goal of mine is not as relevant: I also want to use OpenCV in real-time systems that require managing computation workloads as real-time tasks, using SCHED_DEADLINE scheduler in Linux or other real-time operating systems (RTOSs). One G-API node per real-time task is the granularity I want. That takes at least one thread to represent each node.

These two goals are orthogonal, and I think the first might be more interesting to the OpenCV team. I'm gonna work on the executor for a while. Do you use an email list or chat app like Slack where I can ask questions during the development?

@dmatveev
Copy link
Copy Markdown
Contributor

I want to improve the throughput of neural networks that handle multiple data streams, by exploring parallelism on GPUs through executing multiple kernels simultaneously, and intra-node parallelism on CPUs through multi-threading. So I need finer-grained pipelined execution between nodes, compared to the existing GStreamingExecutor that supports pipelined execution between islands. #17170 seems very relevant, and I'll look into that.

Thanks, it is more clear now. Sounds challenging though :) Can modern GPUs execute multiple kernels simultaneously? I'm just not following the SOTA there. GIslandExecutor still could work for this, if we forbid fusing the relevant (DNN) operations to the same island. It is not supported out of the box, but the island fusion pass can be extended easily for that. Here comes some downsides, though:

  1. GIslandExecutor will create its own thread for every island. If you don't let network fuse into a single island, you'll get as many threads as layers your network has and it will obviously die of oversubsription
  2. Our islands interoperate with CPU data so far (that is, cv::Mat, std::vector<>, and so on) what may also kill the performance if you're running a discrete GPU. This model assumes our data is always "in place" and is never "remote". The support on exchanging the remote data between islands is WIP (you might find some RMat PRs open here). This problem can be work-arounded if use start exchanging with GOpaque<>s of whatever remote data between your "virtual" nodes, and e.g. keep the memory handles within, but changing this affects the API of every operation and is not quite portable (if you put a platform-specific handle into a GOpaque<>.

The proper and so, the G-API-ish way to do that is to write a new backend and map the graph structure to your own execution model. Then the TBB executor could really help, but you'll need to adapt it within your backend (by default it will be used at a higher level, to schedule islands). Having the stuff consolidated within your own backend also solves the memory management problem.

Another goal of mine is not as relevant: I also want to use OpenCV in real-time systems that require managing computation workloads as real-time tasks, using SCHED_DEADLINE scheduler in Linux or other real-time operating systems (RTOSs). One G-API node per real-time task is the granularity I want. That takes at least one thread to represent each node.

That's really interesting :) Please keep us informed with your progress. While this was not the goal for G-API to support, it sounds as mapping to the G-API model and nature pretty well.

@Yougmark
Copy link
Copy Markdown
Author

@dmatveev Thanks for your suggestions! I agree that a backend for my execution model is the way to go. I'm not familiar with TBB, though, and after skimming about TBB, I have a couple of questions:

  • does TBB supports non-busy (suspension-based) waiting, e.g., is the pop function of concurrent_bounded_queue or concurrent_queue non-busy and waits for the signal of new data?
  • can I have persistent threads managed by OS schedulers, instead of tasks managed by the task scheduler in TBB?

If not, I might need to implement non-busy waiting and persistent threads on my own.

Regarding the question about the simultaneous execution on GPU, yes, you can look up concurrent kernel execution, which is supported since NVIDIA Fermi architecture, and similar concurrency is supported on AMD GPUs as well.

@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented Jun 3, 2020

@Yougmark sorry I've missed your comment.

@anton-potapov can you have a look on the questions?

@asmorkalov
Copy link
Copy Markdown
Contributor

@Yougmark Could you rebase the patch to fix merge conflicts.
@anton-potapov Friendly reminder.

@asmorkalov
Copy link
Copy Markdown
Contributor

@Yougmark @anton-potapov Do you have chance to finish the PR? Is it still relevant?

@asmorkalov
Copy link
Copy Markdown
Contributor

@Yougmark @dmatveev @dkurt Is the patch still relevant. Do you plan to work on it in mean time?

@dkurt
Copy link
Copy Markdown
Member

dkurt commented Nov 20, 2020

Hi! Still there is no clear understanding if this implementation can give permormance/memory optimizations.

@Yougmark
Copy link
Copy Markdown
Author

Closing this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants