Remove constraint introduced in #30059#30186
Conversation
|
This would need a cherry pick because it breaks Python BigQueryIO storage write. |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
| if (numStreams < 1) { | ||
| throw new IllegalStateException( | ||
| "numStreams must be set to a positive integer when input data is unbounded."); | ||
| } |
There was a problem hiding this comment.
This check was added because any Python SDK (or Beam Yaml) user who uses the ReadFromBigQuery with Storage API and unbounded data on Dataflow (at least) will not have data written if numStreams is not set to a positive integer. Digging through the logs will show there are warnings about crash loopback due to 2 stateful DoFn's not being able to be fused, but this is not a clear indicator to the issue. This check prevents the job from launching in the first place to avoid user frustration.
There was a problem hiding this comment.
Runner-specific checks should be done by the runner itself (in this case, this should be stopped on the Dataflow side). We'd rather not restrict all runners.
There was a problem hiding this comment.
This check is too broad and breaking exisiting tests has been working on direct runner and dataflow: https://github.com/apache/beam/runs/20839188440 https://github.com/apache/beam/runs/20815247023
On the other hand, there was a similar restriction added to Dataflow side also breaks users and have been rolled back: b/322741233
We were yet to fully understand the cause of "crash loopback ". Using GroupIntoBatches with autosharding alone won't cause issue
There was a problem hiding this comment.
P.S. when DataflowRunner V2 gets fixed, users will be able to set autosharding. If we restrict it on this Beam version, we unnecessarily mark the version as unusable for autosharding even with a fixed Runner V2.
There was a problem hiding this comment.
Both really good points. I just wanted there to be a way to fail earlier (unfortunately xlang doesn't pick up runner implementation from what I could see to check dataflow specifically, but sounds like that isn't feasible. LGTM
Fixes #30159
Unless we fully understand the limitation this kind of proactive restriction has chance to breaking existing use case.
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.