Skip to content

Fix silent failures in Python SDK worker if pickled main session is empty.#14706

Merged
robertwb merged 4 commits intoapache:masterfrom
psobot:patch-1
May 7, 2021
Merged

Fix silent failures in Python SDK worker if pickled main session is empty.#14706
robertwb merged 4 commits intoapache:masterfrom
psobot:patch-1

Conversation

@psobot
Copy link
Copy Markdown
Contributor

@psobot psobot commented May 3, 2021

This pull request fixes a subtle bug that occurs reliably for me on Google Cloud Dataflow. When using save_main_session, Python SDK workers will attempt to download a pickled_main_session file from Google Cloud Storage, and will use that file to reload the __main__ session. Pipelines that depend on this functionality expect names defined in __main__ to be present at runtime, and will fail with NameErrors if the main session is not available.

Unfortunately, roughly 0.5% of the time (in my experience), the Python SDK worker fails to download pickled_main_session from GCS to the local worker machine, and an empty file is created in its place. This causes Dill to log an EOFError1, but then continues to load and run the user code, which then fails due to missing names. For small pipelines (under 20 workers) this almost never occurs; but larger pipelines with hundreds of workers often have at least one machine fail to load the __main__ session due to this bug, and the subsequent NameError exceptions cause the entire batch job to fail.

This patch does three things:

  • checks to see if the pickled main session file is empty
  • if so, raises a new type of exception (CorruptMainSessionException)
  • catches that exception, logs it, and then re-raises, causing the worker to fail to boot.

Admittedly, I don't know if this is the right solution or if it will generalize to all Beam use cases, but I've found this patch to work effectively in my team's pipelines. If a corrupt main session is detected, the worker fails to boot, Cloud Dataflow restarts the container, and the pickled main session loads correctly on the second try. An alternative implementation might be to ensure that the pickled main session's file size matches the size provided by the remote filesystem.

R: @robertwb
R: @lukecwik


1 Example traceback from an empty main session file:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 131, in main
    _load_main_session(semi_persistent_directory)
  File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 236, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/pickler.py", line 318, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 368, in load_session
    module = unpickler.load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
EOFError: Ran out of input

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link
Copy Markdown

codecov bot commented May 3, 2021

Codecov Report

Merging #14706 (884f0cb) into master (d9da8a4) will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14706      +/-   ##
==========================================
- Coverage   83.62%   83.60%   -0.03%     
==========================================
  Files         442      442              
  Lines       59246    59254       +8     
==========================================
- Hits        49545    49537       -8     
- Misses       9701     9717      +16     
Impacted Files Coverage Δ
.../python/apache_beam/io/gcp/datastore/v1new/util.py
...eam/runners/portability/fn_api_runner/fn_runner.py
...uild/srcs/sdks/python/apache_beam/io/gcp/pubsub.py
.../runners/interactive/testing/pipeline_assertion.py
...beam/testing/benchmarks/nexmark/queries/query10.py
.../py38/build/srcs/sdks/python/apache_beam/pvalue.py
..._beam/testing/benchmarks/nexmark/queries/query1.py
...s/sdks/python/apache_beam/runners/worker/logger.py
...hon/apache_beam/runners/worker/bundle_processor.py
...ld/srcs/sdks/python/apache_beam/transforms/util.py
... and 874 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9da8a4...884f0cb. Read the comment docs.

@robertwb
Copy link
Copy Markdown
Contributor

robertwb commented May 4, 2021

Run PythonLint PreCommit

@robertwb
Copy link
Copy Markdown
Contributor

robertwb commented May 4, 2021

Thanks, this looks like a much better error (and recovery). Hopefully @ihji 's work on modernizing the artifact distribution will fix the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants