G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages#21567
G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages#21567alalek merged 4 commits intoopencv:4.xfrom
Conversation
| const cv::Rect &roi, | ||
| const IECallContext& ctx) { | ||
| if (ctx.uu.params.device_id.find("GPU") != std::string::npos) { | ||
| if (ctx.uu.params.device_id.find("GPU") != std::string::npos && |
There was a problem hiding this comment.
to delete: not from this issue
| #include <tuple> | ||
|
|
||
| #include <opencv2/imgproc.hpp> | ||
| //#include <opencv2/highgui.hpp> // CommandLineParser |
There was a problem hiding this comment.
to delete: not from this issue
| if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | ||
| { | ||
| // (3) Wait until all passed task are done. | ||
| GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done"); |
There was a problem hiding this comment.
Troubleshooting & Logging BKMs:
INFO level in log here because pipeline is intending to stop: it's once-time event, not need to worry about performance due to printout here
| || cv::util::holds_alternative<Stop>(cmd)); | ||
| if (cv::util::holds_alternative<Stop>(cmd)) | ||
| { | ||
| GAPI_LOG_INFO(nullptr, "Early stop for: " << emitter->name() << " detected"); |
There was a problem hiding this comment.
Troubleshooting & Logging BKMs:
Log incoming event up right before process it (possible stuck on push())
| GAPI_ITT_STATIC_LOCAL_HANDLE(emitter_push_hndl, "emitter_push"); | ||
|
|
||
| // Now start emitting the data from the source to the pipeline. | ||
| GAPI_LOG_INFO(nullptr, "starting: " << emitter->name()); |
There was a problem hiding this comment.
Troubleshooting & Logging BKMs:
Log every busy-up worker tread right before starting: to link further logger messages by thread number with actual actor by name:
DEBUG(:1)@timestamp1 starting: VideoEmitter <--- mark thread :1 as VideoEmitter
DEBUG:2@timestamp2 blah blah blah
DEBUG:3@timestamp2 blah blah blah
...
DEBUG(:1)@timestamp2 blah blah blah <---belong to VideoEmitter
| void start() { | ||
| GAPI_LOG_INFO(nullptr, "try to start syncActorThread for emitters count: " << | ||
| m_synchronized_emitters.size()); | ||
| if (m_synchronized_emitters.size() != 0) { |
There was a problem hiding this comment.
Troubleshooting & Logging BKMs:
print out COUNT of object BEFORE condition serve TWO purposes:
-to check actual objects count :)
-to find out is the next condition true or false
So we use only one logger row and do not require to logging approaches for different if-branches
TolyaTalamanov
left a comment
There was a problem hiding this comment.
Honestly, from your description I didn't get how it's possible:
but stuck on extracting cmd from other empty queue in rewindToStop
What is the empty queue are u referring to?
|
|
||
| if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | ||
| { | ||
| // (3) Wait until all passed task are done. |
There was a problem hiding this comment.
Why it's needed to remove comment?
There was a problem hiding this comment.
Because comment message is exposed in logger message below - no need to duplicate it in code
There was a problem hiding this comment.
I thought, that we've agreed that logging must be considered separately from this PR
| const cv::GCompileArgs &, | ||
| const std::vector<ade::NodeHandle> &); | ||
|
|
||
| const char* name() const override { return "GStreamingIntrinExecutable"; }; |
There was a problem hiding this comment.
What about backends that are placed in different submodules?
There was a problem hiding this comment.
+1. These quite invasive changes must be dropped from this PR.
There was a problem hiding this comment.
What's about the following proposals: https://github.com/opencv/opencv/pull/21567/files#r801291452 ?
There was a problem hiding this comment.
I wouldn't make hasty decisions besides it doesn't apply to the problem you're trying to fix
| // Obtain next value from the emitter | ||
| virtual bool pull(GRunArg &) = 0; | ||
| virtual ~GIslandEmitter() = default; | ||
| virtual const char* name() const = 0; |
There was a problem hiding this comment.
Adding the new pure virtual method might broke the code which is being developed in other submodules
There was a problem hiding this comment.
+1. All private modules will break after these changes.
There was a problem hiding this comment.
| void run(std::vector<InObj> &&input_objs, | ||
| std::vector<OutObj> &&output_objs) override; | ||
|
|
||
| const char* name() const override { return "GraphMetaExecutable"; }; |
There was a problem hiding this comment.
Can it be generic for all backend?
typeid(*this).name()
or something this :)
There was a problem hiding this comment.
Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later
There was a problem hiding this comment.
@TolyaTalamanov as it was discussed offline - it it possible to put non-pure-virtual function in base class but with some default value "unspecified_executable" which has no necessity in overriding in descendant classes an will not brake anything.
There was a problem hiding this comment.
@dmatveev
I see you point, but from my perspective logging & troubleshooting must serve purposes to resolve BUGs & simplify further RootCauseAnalysis:
It should be avoided to put log in spontaneous places spontaneously, thus current logger messages related to investigation & troubleshooting for existing problem. Also it allows to find & investigate further problem appearance too
These are the reasons why I put it here and why this PR is called Fix streaming hangs for ConstSource & Improve troubleshooting logger messages
There was a problem hiding this comment.
You can do that but please use a less invasive method instead.
@AsyaPronina was able to add ITT tracing for all islands uniformly what works fine across platforms.
Here it is https://github.com/opencv/opencv/blob/4.x/modules/gapi/src/compiler/gislandmodel.hpp#L297
| auto &q2 = ade::util::value(qit); | ||
| if (this_id == id2) continue; | ||
| // Will draining all queue & waiting for stop cmd for all queues except own queue | ||
| size_t expected_stop_count = std::count_if(in_queues.begin(), in_queues.end(), [] (const Q* ptr) { |
There was a problem hiding this comment.
Could you remind me how queue can be nullptr, please?
There was a problem hiding this comment.
| // Collector thread never handles the inputs directly | ||
| // (collector's input queues are always produced by | ||
| // islands in the graph). | ||
| GAPI_LOG_INFO(nullptr, "try on theory..."); |
There was a problem hiding this comment.
Didn't get the log msg
There was a problem hiding this comment.
Comment on top
// In theory, the CNST should never reach here.
// Collector thread never handles the inputs directly
// (collector's input queues are always produced by
// islands in the graph).
I have suspicions about comments in code starting from "In theory..." therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.
I agree that "try on theory..." is a bad message and must be rephrased - @dmatveev , @TolyaTalamanov could you help me to produce a correct message here? because i do not understand "In theory..." comment
There was a problem hiding this comment.
therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.
Let me explain what it means. The code is written here in assumption that something never happens. This assumption is indeed true at the point when the code is written, but things change and years later someone can change something and break old contracts. We've had these examples already
There was a problem hiding this comment.
so the point is not to tell in comments anything for sure. This shouldn't be relevant to the logs you write, though.
| out_iter->ready = true; | ||
| m_postIdx.erase(it); // Drop the link from the cache anyway | ||
| if (out_iter != m_postings[out_idx].begin()) | ||
| bool stop_triggered = false; |
There was a problem hiding this comment.
It should be under mutex as well, right?
There was a problem hiding this comment.
So, we only need this flag for logging?
There was a problem hiding this comment.
it should be not under mutex protection because it is not shared resource but stack-variable
There was a problem hiding this comment.
Yes, we do: Stop event is rare in comparison with regular event and require special processing - thus it is worth to know about it.
This slight refactoring is used to make printout GAPI_LOG_INFO without mutex protection to prevent pessimization
| got_cmd = q2->try_pop(cmd); | ||
| if (got_cmd && cv::util::holds_alternative<Stop>(cmd)) { | ||
| got_stop_count ++; | ||
| GAPI_LOG_INFO(nullptr, "got stop from id: " << id2 << |
There was a problem hiding this comment.
Do we really need such kind of logs?
There was a problem hiding this comment.
it exposes information about how many queues responded by stop and which are not stopped yet
dmatveev
left a comment
There was a problem hiding this comment.
It seems this PR has more changes than really needed
| void run(std::vector<InObj> &&input_objs, | ||
| std::vector<OutObj> &&output_objs) override; | ||
|
|
||
| const char* name() const override { return "GraphMetaExecutable"; }; |
There was a problem hiding this comment.
Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later
| if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | ||
| { | ||
| // (3) Wait until all passed task are done. | ||
| GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done"); |
| // (3) Wait until all passed task are done. | ||
| GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done"); | ||
| m_reqPool->waitAll(); | ||
| GAPI_LOG_INFO(nullptr, "all done"); |
There was a problem hiding this comment.
so would it spam console by default here?
There was a problem hiding this comment.
As i supposed EOS is a rare message (isn't it?) and must have been faced only once.
Are my assumptions correct? If yes - no spam here, only countable output
There was a problem hiding this comment.
EOS may happen multiple times especially in start/stop scenarios.
| const cv::GCompileArgs &, | ||
| const std::vector<ade::NodeHandle> &); | ||
|
|
||
| const char* name() const override { return "GStreamingIntrinExecutable"; }; |
There was a problem hiding this comment.
+1. These quite invasive changes must be dropped from this PR.
| // Obtain next value from the emitter | ||
| virtual bool pull(GRunArg &) = 0; | ||
| virtual ~GIslandEmitter() = default; | ||
| virtual const char* name() const = 0; |
There was a problem hiding this comment.
+1. All private modules will break after these changes.
| GAPI_LOG_INFO(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() << | ||
| ", expected stop msg count: " << expected_stop_count); | ||
| size_t got_stop_count = 0; | ||
| while(got_stop_count < expected_stop_count) { |
| auto &q2 = ade::util::value(qit); | ||
| if (this_id == id2) continue; | ||
|
|
||
| GAPI_LOG_INFO(nullptr, "drain next id: " << id2); |
There was a problem hiding this comment.
I believe these logs are cryptic for those who'll read it. Also, why drain?
There was a problem hiding this comment.
this logs would be read with
"rewindToStop - id: XXX, queues count: YYYY, expected stop msg count: ZZZ"
"rewindToStop - drain next id: 0"
"rewindToStop - drain next id: 1"
"rewindToStop - got stop from id: 1 - (1/ZZZ)"
"rewindToStop - drain next id: 2"
"rewindToStop - got stop from id: 2 - (2/ZZZ)"
"rewindToStop - drain next id: 3"
...
"rewindToStop - drain next id: 0"
"rewindToStop - got stop from id: 0 - (ZZZ/ZZZ)"
together
and represent information about which queue is stopped but which is not yet
(It must have got actual name per each queue instead of id number in further logging improvements)
| const auto &stop = cv::util::get<Stop>(m_cmd[id]); | ||
| if (stop.kind == Stop::Kind::CNST) | ||
| { | ||
| GAPI_LOG_INFO(nullptr, "Got `Stop::Kind::CNST`") |
There was a problem hiding this comment.
I believe we better introduce << for our enums instead. But, again, it should be introduced separately & reviewed well.
There was a problem hiding this comment.
agree,
not in this PR.
without << this message may be rephrased as "Got Stop from Constant Source" (is it correct?)
| // Collector thread never handles the inputs directly | ||
| // (collector's input queues are always produced by | ||
| // islands in the graph). | ||
| GAPI_LOG_INFO(nullptr, "try on theory..."); |
| @@ -801,46 +849,56 @@ class StreamingOutput final: public cv::gimpl::GIslandExecutable::IOutput | |||
| GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post"); | |||
There was a problem hiding this comment.
I believe the fix was in rewind already. So why do we need these changes now?
There was a problem hiding this comment.
Sorry, could you clarify with fix you meant here? because you put comment on GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post"); line without modifications
There was a problem hiding this comment.
The question is why do we need all these below changes
|
To get rid of further discussion - all troubleshooting info were reverted to original state. |
| // because it had got stop message before rewind was called | ||
| expected_stop_count--; | ||
| } | ||
| GAPI_LOG_DEBUG(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() << |
There was a problem hiding this comment.
Not sure that it's can somehow help user
There was a problem hiding this comment.
This is still a debug message, we have some already so I'd not care about this one
There was a problem hiding this comment.
Got it, then @sivanov-work you can ignore these comments :)
G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages
Sometimes by using ConstSource (detected on scenario with Roi generator & VPL Source with VPP preproc) it is possible to get a stuck on end-of-pipeline:
-VideoEmitter resends Stop to ConstEmitter on end-of-stream
-Executor got stop from VideoEmitter and also tries to rewindQueue to extract Stop from other sources (ConstEmiter) but stuck on extracting cmd from other empty queue in rewindToStop
-Meantime ConstEmitter has got Stop from VideoEmitter and tries to push it on queue and got stuck, because this queue is not processed yet on rewindStop, which had stucked on pop from other empty queue
Proposed solution:
To modify rewindToStop to make it unblocking: It analyzed valid queues count (minus 1 as itself) and try to extract Stop Cmds in unblocking manner (try_pop) until received Stop count is not equal to expected queues number
Result:
sample VPL+Preproc+ConstSource doesn't hang anymore
UTs passed
Added troubleshooting info messages
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.
Build configuration