Skip to content

Track bundle processors that are pending creation and terminate SDK if creating a BP exceeds a timeout.#36518

Merged
tvalentyn merged 6 commits intoapache:masterfrom
tvalentyn:dofn_start
Oct 16, 2025
Merged

Track bundle processors that are pending creation and terminate SDK if creating a BP exceeds a timeout.#36518
tvalentyn merged 6 commits intoapache:masterfrom
tvalentyn:dofn_start

Conversation

@tvalentyn
Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn commented Oct 14, 2025

This PR adds instrumentation to visualize more prominently stack traces for operations that are slow or stuck when initializing a DoFn instance and calling DoFn.setup().

Such slow operations will now print a 'processing is stuck' log entry periodically (aka processing lull logging), will show up in sdk_status responses (e.g. Dataflow debug capture) or curl localhost:8081/sdk_status , and additionally the recently introduced --element_processing_timeout option, if specified, will also interrupt such slow operations and restart the sdk harness.

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 17.50000% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.21%. Comparing base (95dcaea) to head (7e28e04).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...python/apache_beam/runners/worker/worker_status.py 10.81% 33 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #36518       +/-   ##
=============================================
- Coverage     55.08%   36.21%   -18.87%     
  Complexity     1667     1667               
=============================================
  Files          1059     1059               
  Lines        165349   165388       +39     
  Branches       1195     1195               
=============================================
- Hits          91075    59893    -31182     
- Misses        72103   103324    +31221     
  Partials       2171     2171               
Flag Coverage Δ
python 40.51% <17.50%> (-40.50%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tvalentyn tvalentyn marked this pull request as ready for review October 15, 2025 01:47
@tvalentyn
Copy link
Copy Markdown
Contributor Author

R: @scwhittle

@github-actions
Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

state += 'ProcessBundleDescriptorId: %s\n' % bundle_id
state += "tracked thread: %s\n" % thread
state += "time since creation started: %.2f seconds\n" % (
time.time() - creation_time)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call time.time() once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
step_name_log = ''
stack_trace = self._get_stack_trace(sampler_info)
stack_trace = self._get_stack_trace(sampler_info.tracked_thread)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was getattr(sampler_info, 'tracked_thread', None) before

are we guaranteed that this attr exists?

Copy link
Copy Markdown
Contributor Author

@tvalentyn tvalentyn Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is a namedtuple,

. I also got puzzled by it and dug up a pr from ~6yrs ago where this line of code was introduced and the rationale was 'just in case'.

Also, above

we have unprotected access.

terminate_sdk_harness()

if (time_since_creation_ns > self.log_lull_timeout_ns and
self._passed_lull_timeout_since_last_log()):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug in _log_lull_sampler_info below
I think the _passed_lull_timeout_since_last_log() should be ordered after checking if time_since_transition > log_lull_timeout_ns

Copy link
Copy Markdown
Contributor Author

@tvalentyn tvalentyn Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, reordering is a better logic. thanks.

@tvalentyn tvalentyn merged commit f07ccf3 into apache:master Oct 16, 2025
105 of 118 checks passed
@tvalentyn tvalentyn deleted the dofn_start branch October 16, 2025 17:10
damccorm pushed a commit that referenced this pull request Oct 20, 2025
* Track bundle processors that are pending creation and terminate SDK if creating a BP exceeds a timeout.

* Rename the term

* Remove unnecessary conditions.

* add tests

* Address comments

* Also add a test for logging a lull in process.
@gemini-code-assist gemini-code-assist bot mentioned this pull request Oct 20, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants