Skip to content

[BEAM-7755] adding support for nested rows and arrays in BigQuery to Beam row conversion#9089

Merged
kennknowles merged 7 commits intoapache:masterfrom
snallapa:repeated-row
Jul 26, 2019
Merged

[BEAM-7755] adding support for nested rows and arrays in BigQuery to Beam row conversion#9089
kennknowles merged 7 commits intoapache:masterfrom
snallapa:repeated-row

Conversation

@snallapa
Copy link
Copy Markdown
Contributor

@snallapa snallapa commented Jul 17, 2019

Example implementation of repeated record conversion from BigQuery to beam row through avro. Sorry this is my first contribution so I am a bit new to this. I wanted to put this up as this worked on a test table with a REPEATED RECORD that I tried on and maybe just put it on the radar?
R: @lukecwik
R: @chamikaramj


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@amaliujia
Copy link
Copy Markdown
Contributor

Could you add a test, which will better illustrate what this change is about?

I think you are trying to handle the case of ROW<ROW<ROW<..>>>

@snallapa
Copy link
Copy Markdown
Contributor Author

@amaliujia hey I added a test. I will clean it up later but I wanted to have it done to show you the problem. It is the case you are talking about and that test fails without the change I made. I also want to point out that all the tests in that file there use a certain toBeamRow method and the one I found this issue on did not have tests for. Is the toBeamRow method that goes through Avro not recommended?

Sahith Nallapareddy added 2 commits July 24, 2019 14:36
Copy link
Copy Markdown
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. It is a major limitation! I left some comments. Basically to make it work more generally. It will not really be harder than what you have done, just a tiny refactor.

ret.add(
convertAvroPrimitiveTypes(
beamField.getType().getCollectionElementType().getTypeName(), v));
FieldType arrayElementType = beamField.getType().getCollectionElementType();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and a little more below, can move out of the loop. Actually you can move it out of the method entirely, I think. You don't need to pass the whole Field but it should only depend on the type, right?

Copy link
Copy Markdown
Contributor

@amaliujia amaliujia Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

values actually share the same type because it is an ARRAY. At least collection element type can be removed out of loop

GenericData.Record record = (GenericData.Record) v;
ret.add(toBeamRow(record, arrayElementType.getRowSchema(), options));
} else {
ret.add(convertAvroPrimitiveTypes(elementTypeName, v));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if / else piece should just be a recursive call into convertAvroFormat. You will need to update it to only need the FieldType and not the whole Field. This should be easy, I think. I just read the code and it does not seem to actually need the whole Field.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1.

Using convertAvroFormat will make sure all other types (not limited to ROW or primitive) can be handled. It that case this is changed to handle ARRAY<T>, not limited to ARRAY<ROW> or ARRAY<primitive type>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only spot that needed the entire Field was using its name, so unfortunately if we encounter an error we can't inform the user what Field name caused that error (but we can at least tell them the type)

Copy link
Copy Markdown
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update since last time.

ret.add(
convertAvroPrimitiveTypes(
beamField.getType().getCollectionElementType().getTypeName(), v));
FieldType arrayElementType = beamField.getType().getCollectionElementType();
Copy link
Copy Markdown
Contributor

@amaliujia amaliujia Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

values actually share the same type because it is an ARRAY. At least collection element type can be removed out of loop

GenericData.Record record = (GenericData.Record) v;
ret.add(toBeamRow(record, arrayElementType.getRowSchema(), options));
} else {
ret.add(convertAvroPrimitiveTypes(elementTypeName, v));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1.

Using convertAvroFormat will make sure all other types (not limited to ROW or primitive) can be handled. It that case this is changed to handle ARRAY<T>, not limited to ARRAY<ROW> or ARRAY<primitive type>

@snallapa
Copy link
Copy Markdown
Contributor Author

Thanks to both of you for the feedback! I updated my branch with some changes and I added one more test as well.

Copy link
Copy Markdown
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement! I will wait for the test results and then merge.

@kennknowles kennknowles changed the title [BEAM-7755] adding repeated row implementation that worked on a simple case [BEAM-7755] adding support for nested rows and arrays in BigQuery to Beam row conversion Jul 26, 2019
@kennknowles kennknowles self-requested a review July 26, 2019 19:03
@snallapa
Copy link
Copy Markdown
Contributor Author

thanks!! @kennknowles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants