[BEAM-7755] adding support for nested rows and arrays in BigQuery to Beam row conversion#9089
Conversation
|
Could you add a test, which will better illustrate what this change is about? I think you are trying to handle the case of |
|
@amaliujia hey I added a test. I will clean it up later but I wanted to have it done to show you the problem. It is the case you are talking about and that test fails without the change I made. I also want to point out that all the tests in that file there use a certain toBeamRow method and the one I found this issue on did not have tests for. Is the toBeamRow method that goes through Avro not recommended? |
kennknowles
left a comment
There was a problem hiding this comment.
Thanks for working on this. It is a major limitation! I left some comments. Basically to make it work more generally. It will not really be harder than what you have done, just a tiny refactor.
| ret.add( | ||
| convertAvroPrimitiveTypes( | ||
| beamField.getType().getCollectionElementType().getTypeName(), v)); | ||
| FieldType arrayElementType = beamField.getType().getCollectionElementType(); |
There was a problem hiding this comment.
This, and a little more below, can move out of the loop. Actually you can move it out of the method entirely, I think. You don't need to pass the whole Field but it should only depend on the type, right?
There was a problem hiding this comment.
+1
values actually share the same type because it is an ARRAY. At least collection element type can be removed out of loop
| GenericData.Record record = (GenericData.Record) v; | ||
| ret.add(toBeamRow(record, arrayElementType.getRowSchema(), options)); | ||
| } else { | ||
| ret.add(convertAvroPrimitiveTypes(elementTypeName, v)); |
There was a problem hiding this comment.
I think this if / else piece should just be a recursive call into convertAvroFormat. You will need to update it to only need the FieldType and not the whole Field. This should be easy, I think. I just read the code and it does not seem to actually need the whole Field.
There was a problem hiding this comment.
big +1.
Using convertAvroFormat will make sure all other types (not limited to ROW or primitive) can be handled. It that case this is changed to handle ARRAY<T>, not limited to ARRAY<ROW> or ARRAY<primitive type>
There was a problem hiding this comment.
The only spot that needed the entire Field was using its name, so unfortunately if we encounter an error we can't inform the user what Field name caused that error (but we can at least tell them the type)
amaliujia
left a comment
There was a problem hiding this comment.
Thanks for the update since last time.
| ret.add( | ||
| convertAvroPrimitiveTypes( | ||
| beamField.getType().getCollectionElementType().getTypeName(), v)); | ||
| FieldType arrayElementType = beamField.getType().getCollectionElementType(); |
There was a problem hiding this comment.
+1
values actually share the same type because it is an ARRAY. At least collection element type can be removed out of loop
| GenericData.Record record = (GenericData.Record) v; | ||
| ret.add(toBeamRow(record, arrayElementType.getRowSchema(), options)); | ||
| } else { | ||
| ret.add(convertAvroPrimitiveTypes(elementTypeName, v)); |
There was a problem hiding this comment.
big +1.
Using convertAvroFormat will make sure all other types (not limited to ROW or primitive) can be handled. It that case this is changed to handle ARRAY<T>, not limited to ARRAY<ROW> or ARRAY<primitive type>
...ogle-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java
Outdated
Show resolved
Hide resolved
|
Thanks to both of you for the feedback! I updated my branch with some changes and I added one more test as well. |
...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Outdated
Show resolved
Hide resolved
kennknowles
left a comment
There was a problem hiding this comment.
Very nice improvement! I will wait for the test results and then merge.
|
thanks!! @kennknowles |
Example implementation of repeated record conversion from BigQuery to beam row through avro. Sorry this is my first contribution so I am a bit new to this. I wanted to put this up as this worked on a test table with a REPEATED RECORD that I tried on and maybe just put it on the radar?
R: @lukecwik
R: @chamikaramj
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.