Fix silent failures in Python SDK worker if pickled main session is empty.#14706
Merged
robertwb merged 4 commits intoapache:masterfrom May 7, 2021
Merged
Fix silent failures in Python SDK worker if pickled main session is empty.#14706robertwb merged 4 commits intoapache:masterfrom
robertwb merged 4 commits intoapache:masterfrom
Conversation
Contributor
|
Run PythonLint PreCommit |
Contributor
|
Thanks, this looks like a much better error (and recovery). Hopefully @ihji 's work on modernizing the artifact distribution will fix the root cause. |
robertwb
approved these changes
May 4, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request fixes a subtle bug that occurs reliably for me on Google Cloud Dataflow. When using
save_main_session, Python SDK workers will attempt to download apickled_main_sessionfile from Google Cloud Storage, and will use that file to reload the__main__session. Pipelines that depend on this functionality expect names defined in__main__to be present at runtime, and will fail withNameErrors if the main session is not available.Unfortunately, roughly 0.5% of the time (in my experience), the Python SDK worker fails to download
pickled_main_sessionfrom GCS to the local worker machine, and an empty file is created in its place. This causes Dill to log anEOFError1, but then continues to load and run the user code, which then fails due to missing names. For small pipelines (under 20 workers) this almost never occurs; but larger pipelines with hundreds of workers often have at least one machine fail to load the__main__session due to this bug, and the subsequentNameErrorexceptions cause the entire batch job to fail.This patch does three things:
CorruptMainSessionException)Admittedly, I don't know if this is the right solution or if it will generalize to all Beam use cases, but I've found this patch to work effectively in my team's pipelines. If a corrupt main session is detected, the worker fails to boot, Cloud Dataflow restarts the container, and the pickled main session loads correctly on the second try. An alternative implementation might be to ensure that the pickled main session's file size matches the size provided by the remote filesystem.
R: @robertwb
R: @lukecwik
1 Example traceback from an empty main session file:
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunnercompliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.