-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix RowConverter panic when encoding DictionaryArrays in StructArray / ListArray
#7627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @ding-young
I am sorry I don't fully understand what is wrong and what this PR is proposing.
Although RowConverter flattens data type on converting columns into rows, it builds array with original SortField which contains unflattened types when converting rows back into columns.
I believe the intention is to build arrays with the original DataTypes.
Specifically, I expect the RowConverter to have the property that if you convert Arrays to Rows and then those Rows back to Arrays, the arrays are the same that went in
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
|
👋 @ding-young what is the status of this PR? do you mind if I give it a shot? it is causing errors in |
|
Hi @LiaCastaneda :) I just ran the reproducer test, and it passes without error on the fix-row-converter branch. However, this PR takes the approach of correcting the data type, rather than preserving the original dictionary encoding. So #7627 (comment) the resulting type is flattened, like mentioned here #7169 (comment). I can easily follow up on the minor review comments in a day, but currently I don’t have time to rewrite the decoding logic to preserve the original encoding like @alamb suggested. What kind of solution were you thinking of? If your main goal is just to suppress the warning and run the reproducer query, then I can quickly finish up this PR and ping you again. But if you’re planning to work on a solution that preserves the original dictionary encoding, feel free to take it over! Let me know what you think :) |
|
Guiding myself on @alamb 's comment, I understand the correct solution is to preserve Dict encoding so I was looking into following that approach. |
|
I'm sketching a solution for this, and I'm wondering if we want to keep the exact same encoding, one thing that I think we could do is manually build a Dictionary using a DictionaryBuilder or perform a cast using Update: I tried the cast approach here https://github.com/apache/arrow-rs/pull/8067/files and I believe the encoding would be the same, if not the test I included would not pass. I will try to open the PR later today or tomorrow |
830a059 to
ed4c20e
Compare
|
If #8067 (comment) that's the case, would you @tustvold mind taking a look when you get a chance? Plus, let me know if there’s anything else I can help with :) @LiaCastaneda . |
|
Thanks for readressing the PR @ding-young 🙇♀️ |
210922c to
847703e
Compare
ding-young
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the unit tests to compare each value and check if the data (logically) remains the same. thanks @gabotechs @alamb
Yes, I think keeping the values logically the same (with the same datatype) is what is desired |
|
I am starting to check this PR out |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ding-young and @LiaCastaneda
After some more review, I think this PR follows the guidance of @tustvold on #7165 (comment)
all dictionaries are expanded on encode, so I would expect convert_rows to product a List of Int32.
I am sorry I was confused earlier about the intended design. I left a few comments, but I think this PR is very close.
I also think it is important to improve the documentation to clarify the expected behavior in this case. I have created a proposed PR here:
arrow-row/src/lib.rs
Outdated
| vec![Box::new(builder)], | ||
| ); | ||
| let dict_builder = struct_builder | ||
| .field_builder::<PrimitiveDictionaryBuilder<Int32Type, Int32Type>>(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please change this so it uses different types for the keys and values -- otherwise it is hard to keep track of the key and values in the code below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the dictionary type to Dictionary(Int32, Utf8) to improve readability. Let me know if anything in the test is still confusing or hard to follow.
DictionaryArrays in StructArrays and ListArrays
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ding-young and @LiaCastaneda
DictionaryArrays in StructArrays and ListArraysDictionaryArrays in StructArray / ListArray
# Which issue does this PR close? - related to #7627 - Related to #4811 # Rationale for this change It was not clear to me what the expected behavior for round trip through row converter was for DictionaryArrays, so let's document what @tustvold says here: #8067 (comment) > I think the issue is that Datafusion is not handling the fact that row encoding "hydrates" dictionaries. It should be updated to understand that List<Dictionary<...>> will be converted to List<...>, much like it already handles this for the non-nested case. Converting back to a dictionary is expensive, and likely pointless, not to mention a breaking change. # What changes are included in this PR? Document expected behavior with english comments and doc test # Are these changes tested? Yes (doctests) # Are there any user-facing changes? More docs, no behavior change

Which issue does this PR close?
RowConverter::convert_rowspanics when decodingList(Dictionary)#7165RowConverter::convert_rowscan return an invalid array ifStructcontainsDictionary#7169Rationale for this change
Although
RowConverterflattens data type on converting columns into rows, it builds array with originalSortFieldwhich contains unflattened types when converting rows back into columns. Therefore, output array has inconsistent data type although it is actually flattened.What changes are included in this PR?
When decoding columns, instead of using original
SortField, it uses new field with updated data type of childArrayData, which is flattened.I've also considered alternative approaches like recursively modify all the
fieldsonconvert_rowsorconvert_raw, but considering that we already visit each field recursively, I just corrected the field indecode_columnfor simplicity.I'd be happy to hear about the feedback on correctness of this pr and any other suggestions.
Are there any user-facing changes?