Support custom avro DatumReader when reading from BigQuery#22718
Support custom avro DatumReader when reading from BigQuery#22718steveniemitz merged 21 commits intoapache:masterfrom
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
can we plumb this down into BigQueryStorageAvroReader as well so it'll work with direct reads? |
There was a problem hiding this comment.
can we make the reader schema optional, and use the writer schema if its not set? Thats how AvroIO works IIRC.
There was a problem hiding this comment.
It seems like AvroIO derives it in case the class is of SpecificData subType. We could do the same but we would need to take the class type as input instead (don't think its cleaner though).
Also, readerSchema is needed by AvroSource in general, as it does validation based on that, if parserFn is not set,
as well as derives the AvroCoder based on the readerSchema. Since it gets the writer schema during runtime from the file metadata, we cannot assign reader schema as writer schema on the submitter side.
yeah, i was thinking of tackling that in another review. I could add in this one too to allow this for different read types. |
that's cool, you can do it in another review, just curious. |
|
Reminder, please take a look at this pr: @kennknowles @chamikaramj |
|
R: @chamikaramj |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
This reverts commit 80da4eff8bf674f3f27eee5733a9ec714dba1746.
This reverts commit b2b4737.
Codecov Report
@@ Coverage Diff @@
## master #22718 +/- ##
==========================================
+ Coverage 73.44% 83.40% +9.95%
==========================================
Files 718 480 -238
Lines 95680 66483 -29197
==========================================
- Hits 70273 55447 -14826
+ Misses 24096 11036 -13060
+ Partials 1311 0 -1311
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
Run Java PreCommit |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
Run Java PreCommit |
|
Run Java_Examples_Dataflow PreCommit |
|
Run Java_Examples_Dataflow_Java11 PreCommit |
|
Run Java_Examples_Dataflow_Java17 PreCommit |
|
Run Java PreCommit |
|
Run Java PreCommit |
|
I ran and profiled a few dataflow jobs with and without my changes to test for performance regression. Overall the profiling data looks good for these new changes. |
|
I believe this PR broke Java Dataflow PostCommits: https://ci-beam.apache.org/job/beam_PostCommit_Java_DataflowV1/2173/testReport/. See #23541 |
Similar to 11479 this PR adds functionality to directly deserialize Avro records to the target class, via the user specified DatumReader, in BigQueryIO.
R: @pabloem @chamikaramj @steveniemitz
This fixes #22717
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.