Skip to content

Remove constraint introduced in #30059#30186

Merged
Abacn merged 1 commit intoapache:masterfrom
Abacn:removeconstrant
Feb 1, 2024
Merged

Remove constraint introduced in #30059#30186
Abacn merged 1 commit intoapache:masterfrom
Abacn:removeconstrant

Conversation

@Abacn
Copy link
Copy Markdown
Contributor

@Abacn Abacn commented Feb 1, 2024

Fixes #30159

Unless we fully understand the limitation this kind of proactive restriction has chance to breaking existing use case.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn
Copy link
Copy Markdown
Contributor Author

Abacn commented Feb 1, 2024

R: @lostluck @ahmedabu98

This would need a cherry pick because it breaks Python BigQueryIO storage write.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 1, 2024

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Comment on lines -388 to -391
if (numStreams < 1) {
throw new IllegalStateException(
"numStreams must be set to a positive integer when input data is unbounded.");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was added because any Python SDK (or Beam Yaml) user who uses the ReadFromBigQuery with Storage API and unbounded data on Dataflow (at least) will not have data written if numStreams is not set to a positive integer. Digging through the logs will show there are warnings about crash loopback due to 2 stateful DoFn's not being able to be fused, but this is not a clear indicator to the issue. This check prevents the job from launching in the first place to avoid user frustration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runner-specific checks should be done by the runner itself (in this case, this should be stopped on the Dataflow side). We'd rather not restrict all runners.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is too broad and breaking exisiting tests has been working on direct runner and dataflow: https://github.com/apache/beam/runs/20839188440 https://github.com/apache/beam/runs/20815247023

On the other hand, there was a similar restriction added to Dataflow side also breaks users and have been rolled back: b/322741233

We were yet to fully understand the cause of "crash loopback ". Using GroupIntoBatches with autosharding alone won't cause issue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. when DataflowRunner V2 gets fixed, users will be able to set autosharding. If we restrict it on this Beam version, we unnecessarily mark the version as unusable for autosharding even with a fixed Runner V2.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both really good points. I just wanted there to be a way to fail earlier (unfortunately xlang doesn't pick up runner implementation from what I could see to check dataflow specifically, but sounds like that isn't feasible. LGTM

@Abacn Abacn merged commit 1d9f604 into apache:master Feb 1, 2024
@Abacn Abacn deleted the removeconstrant branch February 1, 2024 17:42
Abacn added a commit to Abacn/beam that referenced this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failing Test]: PostCommit Python Xlang Gcp Direct and Dataflow failing due to #30059

3 participants