Skip to content

Softens the GBEK determinism requirement#36495

Merged
damccorm merged 3 commits intomasterfrom
users/damccorm/softenDeterminismRequirement
Oct 14, 2025
Merged

Softens the GBEK determinism requirement#36495
damccorm merged 3 commits intomasterfrom
users/damccorm/softenDeterminismRequirement

Conversation

@damccorm
Copy link
Copy Markdown
Contributor

This drops the determinism requirement for GBEK coders from an error to a warning. This matches what GBK does today, which is important because users should be able to just drop in a --gbek pipeline option and have things just work.

Today, some of our built-in beam transforms fail with this left in. For example, without this change, testDataframeSum fails with:

java.lang.RuntimeException: Traceback (most recent call last):
  File "apache_beam/coders/coder_impl.py", line 540, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic
  File "apache_beam/coders/coder_impl.py", line 460, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 481, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 544, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic
TypeError: Unable to deterministically encode 'BlockManager
Items: Index(['b'], dtype='object')
Axis 1: Index([100], dtype='int64', name='a')
NumpyBlock: slice(0, 1, 1), 1 x 1, dtype: int32' of type '<class 'pandas.core.internals.managers.BlockManager'>', please provide a type hint for the input of 'GroupByEncryptedKey Group by encrypted keyThe key coder is not deterministic. This may result in incorrect pipeline output. This can be fixed by adding a type hint to the operation preceding the GroupByKey step, and for custom key classes, by writing a deterministic custom Coder. Please see the documentation for more details.'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1498, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 684, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1673, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "/usr/local/lib/python3.13/site-packages/apache_beam/transforms/util.py", line 444, in process
    encoded_value = self.value_coder.encode(v)
  File "/usr/local/lib/python3.13/site-packages/apache_beam/coders/coders.py", line 459, in encode
    return self.get_impl().encode(value)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "apache_beam/coders/coder_impl.py", line 237, in apache_beam.coders.coder_impl.StreamCoderImpl.encode
  File "apache_beam/coders/coder_impl.py", line 240, in apache_beam.coders.coder_impl.StreamCoderImpl.encode
  File "apache_beam/coders/coder_impl.py", line 1120, in apache_beam.coders.coder_impl.AbstractComponentCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 481, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 542, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic
TypeError: Unable to deterministically encode '     b
a     
100  3' of type '<class 'pandas.core.frame.DataFrame'>', please provide a type hint for the input of 'GroupByEncryptedKey Group by encrypted keyThe key coder is not deterministic. This may result in incorrect pipeline output. This can be fixed by adding a type hint to the operation preceding the GroupByKey step, and for custom key classes, by writing a deterministic custom Coder. Please see the documentation for more details.'

During handling of the above exception, another exception occurred:

I'd assume other dataframe tests fail similarly.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @damccorm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request modifies the behavior of Group By Encrypted Key (GBEK) coders in Apache Beam. Previously, a non-deterministic key coder would cause a pipeline to fail with an error. This change softens that requirement, converting the error into a warning. This adjustment aims to improve usability, allowing existing transforms, particularly those involving data structures like Pandas DataFrames, to function correctly with GBEK without immediate pipeline termination, aligning GBEK's behavior with that of standard Group By Key (GBK) operations.

Highlights

  • GBEK Determinism Requirement: The determinism requirement for Group By Encrypted Key (GBEK) coders has been softened from an error to a warning. This means that if a key coder is not deterministic, the pipeline will now log a warning instead of failing.
  • Behavior Alignment: This change aligns GBEK's behavior with that of standard Group By Key (GBK) operations, where non-deterministic coders typically result in warnings rather than errors.
  • Improved Compatibility: The modification prevents pipeline failures with certain built-in Beam transforms, particularly those involving complex data types like Pandas DataFrames, which previously caused TypeError exceptions due to non-deterministic encoding.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@damccorm damccorm marked this pull request as draft October 13, 2025 19:51
@damccorm damccorm marked this pull request as ready for review October 13, 2025 20:01
@damccorm
Copy link
Copy Markdown
Contributor Author

R: @claudevdm

@github-actions
Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@damccorm damccorm merged commit 385271b into master Oct 14, 2025
127 of 144 checks passed
@damccorm damccorm deleted the users/damccorm/softenDeterminismRequirement branch October 14, 2025 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants