-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8289: [Rust] Parquet Arrow writer with nested support #7319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8289: [Rust] Parquet Arrow writer with nested support #7319
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
0dfae4f to
cb740f8
Compare
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nevi-me , left some comments. I think it we should think about how this code can evolve to handle other nested types (especially lists), and whether it will require significant changes and refactoring to achieve that goal.
IMO we can solve this in two steps: 1) have a top-level method to compute def/rep levels for all leaf arrays of a given an arrow array, and then 2) use column writer to write Parquet value/def/rep using the result from the previous step and the input leaf array.
|
|
||
| struct ArrowWriter { | ||
| writer: SerializedFileWriter<File>, | ||
| rows: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May add comments on what this is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented, and removed the rows field. @andygrove I think I copied it as is from your initial draft. Did you want to be able to check how many rows have been written? I think it ends up being the same as the batch's length
| "Parent definition levels must equal array length" | ||
| ); | ||
| let levels = (0..len) | ||
| .map(|index| (array.is_valid(index) as i16 + level) * parent_levels[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this correct? if I have a deeply nested struct and if its parent is null, then its def level is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten this and fixed it
| A: arrow::datatypes::ArrowNumericType, | ||
| T::T: From<A::Native>, | ||
| { | ||
| let mut values = Vec::with_capacity(array.len() - array.null_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the best way to handle this. For one, we can skip the for-loop if all array elements are not-null. Also perhaps we should use array builder for this? Eventually we can have a write_arrow_batch method in column/writer.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that it's better to avoid this function if there are no nulls, I'll look into this.
FWIW, this is the approach the C++ code takes (supports arbitrary nessting). One thing to note which is currently a bug in C++ is once rep/def levels are computed for any anything |
|
Small comment about the API, something I found necessary was being able to pass in props to the writer interface: urbanlogiq@5e08cf6 The idea here is so that we can set the created_by attributes and compression, as an example. |
|
Oh! One more :) This one uses the ParquetWriter trait instead of std::fs::File: urbanlogiq@e15851e (our primary usecase is we write parquet files to memory because they immediately get persisted to object storage like AWS S3 so we never deal with local files) |
|
@maxburke there's been some interest from other people on this PR. I haven't been able to continue working on it because where I have a bit of free time I've been looking at the IPC/integration issues (Rust doesn't work with 0.15+ files). Please feel free to push changes against this PR, or to open a PR against my fork with upstream changes. There's also someone who reached out to me on Twitter asking how they can continue with this. |
cb740f8 to
0c3aad7
Compare
@emkornfield not sure if I understand this part, I'll try create a nested batch with a few levels, and have one record have the top level be nested. Would this cover the case above? I might also be limited by https://issues.apache.org/jira/browse/ARROW-5408 for now |
|
@sunchao @andygrove (CC @wesm @kszucs @emkornfield) in the past few months we haven't had enough review bandwidth on Rust's Parquet implementation (mostly relying on Chao for non-trivial reviews), and given the amount of work needed for an Arrow writer + the interest so far (I think few people already using this fork), I'd like to propose:
ITO this PR, I think I've gotten arbitrary nesting covered, but there's a lot more work that we can now divide more easily so others can contribute better. I'm also unsure of how to test deeply nested arrays directly in the code (I had to use Spark because Arrow reader doesn't yet support that). I'll also bring this up in the mailing list for wider visibility |
|
@nevi-me Sounds good to me. Thanks. |
* compute def and rep levels for all arrays in a batch at once * support arbitrary nesting of lists and structs * add doc comments on structs and functions
9d70836 to
2573302
Compare
There are two bugs in C++ (one with an open PR). The first bug is if you have a schema like The second bug we have with no PR. if you have a schema |
| ArrowDataType::FixedSizeBinary(_) => unimplemented!(), | ||
| ArrowDataType::LargeBinary => unimplemented!(), | ||
| ArrowDataType::List(_) | ArrowDataType::LargeList(_) => { | ||
| // a list can either be nested or flat. If it is flat, def and rep lengths will be the length of the list's items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't true. you can have many "empty" list which can have no items. each empty list adds a def and rep level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this comment is incorrect. I wrote it before I tried to write an empty list. I'll remove it.
The arrow_writer_list test covers this scenario
In C++ I covered at least partial testing by testing rep and def level generation directly |
|
+1 on the approach as well. Thanks @nevi-me for all the efforts! |
* Preliminary fleshing out of writers for binary/string types * Add test for writing string and binary data to parquet from arrow * Incorporate feedback
|
Merged as 80a9c02, not sure why the merge tool didn't close the issue |
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to apache#6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes apache#7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Note: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):
It does not yet support:
I have only added a test by creating a nested schema, which I tested on pyarrow.
This PR potentially addresses:
And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.
Help Needed
I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:
I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.
CC @sunchao @sadikovi @andygrove @paddyhoran
Might be of interest to @mcassels @maxburke