Track bundle processors that are pending creation and terminate SDK if creating a BP exceeds a timeout.#36518
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #36518 +/- ##
=============================================
- Coverage 55.08% 36.21% -18.87%
Complexity 1667 1667
=============================================
Files 1059 1059
Lines 165349 165388 +39
Branches 1195 1195
=============================================
- Hits 91075 59893 -31182
- Misses 72103 103324 +31221
Partials 2171 2171
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…f creating a BP exceeds a timeout.
|
R: @scwhittle |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
| state += 'ProcessBundleDescriptorId: %s\n' % bundle_id | ||
| state += "tracked thread: %s\n" % thread | ||
| state += "time since creation started: %.2f seconds\n" % ( | ||
| time.time() - creation_time) |
| else: | ||
| step_name_log = '' | ||
| stack_trace = self._get_stack_trace(sampler_info) | ||
| stack_trace = self._get_stack_trace(sampler_info.tracked_thread) |
There was a problem hiding this comment.
this was getattr(sampler_info, 'tracked_thread', None) before
are we guaranteed that this attr exists?
There was a problem hiding this comment.
yes, it is a namedtuple,
. I also got puzzled by it and dug up a pr from ~6yrs ago where this line of code was introduced and the rationale was 'just in case'.Also, above
we have unprotected access.| terminate_sdk_harness() | ||
|
|
||
| if (time_since_creation_ns > self.log_lull_timeout_ns and | ||
| self._passed_lull_timeout_since_last_log()): |
There was a problem hiding this comment.
I think there is a bug in _log_lull_sampler_info below
I think the _passed_lull_timeout_since_last_log() should be ordered after checking if time_since_transition > log_lull_timeout_ns
There was a problem hiding this comment.
yeah, reordering is a better logic. thanks.
* Track bundle processors that are pending creation and terminate SDK if creating a BP exceeds a timeout. * Rename the term * Remove unnecessary conditions. * add tests * Address comments * Also add a test for logging a lull in process.
This PR adds instrumentation to visualize more prominently stack traces for operations that are slow or stuck when initializing a DoFn instance and calling DoFn.setup().
Such slow operations will now print a 'processing is stuck' log entry periodically (aka processing lull logging), will show up in sdk_status responses (e.g. Dataflow debug capture) or
curl localhost:8081/sdk_status, and additionally the recently introduced--element_processing_timeoutoption, if specified, will also interrupt such slow operations and restart the sdk harness.