Fix RowConverter panic when encoding `DictionaryArray`s in `StructArray` / `ListArray` #7627

ding-young · 2025-06-08T15:45:24Z

Which issue does this PR close?

Closes RowConverter::convert_rows panics when decoding List(Dictionary) #7165
Closes RowConverter::convert_rows can return an invalid array if Struct contains Dictionary #7169

Rationale for this change

Although RowConverter flattens data type on converting columns into rows, it builds array with original SortField which contains unflattened types when converting rows back into columns. Therefore, output array has inconsistent data type although it is actually flattened.

What changes are included in this PR?

When decoding columns, instead of using original SortField, it uses new field with updated data type of child ArrayData, which is flattened.

I've also considered alternative approaches like recursively modify all the fields on convert_rows or convert_raw, but considering that we already visit each field recursively, I just corrected the field in decode_column for simplicity.

I'd be happy to hear about the feedback on correctness of this pr and any other suggestions.

Are there any user-facing changes?

arrow-row/src/lib.rs

alamb

Thank you for the contribution @ding-young

I am sorry I don't fully understand what is wrong and what this PR is proposing.

Although RowConverter flattens data type on converting columns into rows, it builds array with original SortField which contains unflattened types when converting rows back into columns.

I believe the intention is to build arrays with the original DataTypes.

Specifically, I expect the RowConverter to have the property that if you convert Arrays to Rows and then those Rows back to Arrays, the arrays are the same that went in

arrow-row/src/lib.rs

alamb · 2025-06-16T10:21:39Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

LiaCastaneda · 2025-08-06T11:38:18Z

👋 @ding-young what is the status of this PR? do you mind if I give it a shot? it is causing errors in DataFusion apache/datafusion#17012

ding-young · 2025-08-06T12:59:53Z

Hi @LiaCastaneda :)

I just ran the reproducer test, and it passes without error on the fix-row-converter branch.

However, this PR takes the approach of correcting the data type, rather than preserving the original dictionary encoding. So #7627 (comment) the resulting type is flattened, like mentioned here #7169 (comment).

I can easily follow up on the minor review comments in a day, but currently I don’t have time to rewrite the decoding logic to preserve the original encoding like @alamb suggested.

What kind of solution were you thinking of? If your main goal is just to suppress the warning and run the reproducer query, then I can quickly finish up this PR and ping you again. But if you’re planning to work on a solution that preserves the original dictionary encoding, feel free to take it over!

Let me know what you think :)

LiaCastaneda · 2025-08-06T13:11:59Z

Guiding myself on @alamb 's comment, I understand the correct solution is to preserve Dict encoding so I was looking into following that approach.

LiaCastaneda · 2025-08-06T14:16:05Z

I'm sketching a solution for this, and I'm wondering if we want to keep the exact same encoding, one thing that I think we could do is manually build a Dictionary using a DictionaryBuilder or perform a cast using arrow_cast::cast which does the same under the hood. This would keep the DataType but i'm not sure it will have the exact same encoding as the original Dict Array. If we want to keep the exact same Dict encoding then we would probably need to keep the original DictionaryArray values somewhere before/during encoding the Array to rows.

Update: I tried the cast approach here https://github.com/apache/arrow-rs/pull/8067/files and I believe the encoding would be the same, if not the test I included would not pass. I will try to open the PR later today or tomorrow

ding-young · 2025-08-07T03:15:55Z

If #8067 (comment) that's the case, would you @tustvold mind taking a look when you get a chance?

Plus, let me know if there’s anything else I can help with :) @LiaCastaneda .

LiaCastaneda · 2025-08-07T06:39:51Z

Thanks for readressing the PR @ding-young 🙇‍♀️

ding-young

I updated the unit tests to compare each value and check if the data (logically) remains the same. thanks @gabotechs @alamb

arrow-row/src/lib.rs

alamb · 2025-08-18T18:29:52Z

I updated the unit tests to compare each value and check if the data (logically) remains the same. thanks @gabotechs @alamb

Yes, I think keeping the values logically the same (with the same datatype) is what is desired

alamb · 2025-08-18T18:30:09Z

I am starting to check this PR out

alamb

Thank you @ding-young and @LiaCastaneda

After some more review, I think this PR follows the guidance of @tustvold on #7165 (comment)

all dictionaries are expanded on encode, so I would expect convert_rows to product a List of Int32.

I am sorry I was confused earlier about the intended design. I left a few comments, but I think this PR is very close.

I also think it is important to improve the documentation to clarify the expected behavior in this case. I have created a proposed PR here:

#8168

arrow-row/src/lib.rs

alamb · 2025-08-18T18:44:47Z

arrow-row/src/lib.rs

+            vec![Box::new(builder)],
+        );
+        let dict_builder = struct_builder
+            .field_builder::<PrimitiveDictionaryBuilder<Int32Type, Int32Type>>(0)


Can we please change this so it uses different types for the keys and values -- otherwise it is hard to keep track of the key and values in the code below

I changed the dictionary type to Dictionary(Int32, Utf8) to improve readability. Let me know if anything in the test is still confusing or hard to follow.

arrow-row/src/list.rs

alamb

Thank you @ding-young and @LiaCastaneda

@tustvold

# Which issue does this PR close? - related to #7627 - Related to #4811 # Rationale for this change It was not clear to me what the expected behavior for round trip through row converter was for DictionaryArrays, so let's document what @tustvold says here: #8067 (comment) > I think the issue is that Datafusion is not handling the fact that row encoding "hydrates" dictionaries. It should be updated to understand that List<Dictionary<...>> will be converted to List<...>, much like it already handles this for the non-nested case. Converting back to a dictionary is expensive, and likely pointless, not to mention a breaking change. # What changes are included in this PR? Document expected behavior with english comments and doc test # Are these changes tested? Yes (doctests) # Are there any user-facing changes? More docs, no behavior change

github-actions bot added the arrow Changes to the arrow crate label Jun 8, 2025

alamb changed the title ~~Update field with child data type in decode_columns~~ Fix RowConverter panic with for deeply nested structures Jun 9, 2025

alamb reviewed Jun 12, 2025

View reviewed changes

arrow-row/src/lib.rs Outdated Show resolved Hide resolved

alamb reviewed Jun 12, 2025

View reviewed changes

arrow-row/src/lib.rs Outdated Show resolved Hide resolved

arrow-row/src/lib.rs Outdated Show resolved Hide resolved

arrow-row/src/lib.rs Outdated Show resolved Hide resolved

alamb marked this pull request as draft June 16, 2025 10:21

nirnayroy mentioned this pull request Aug 5, 2025

Query grouping by column with datatype List<Dictionary<(),()>> is failing apache/datafusion#17012

Closed

LiaCastaneda mentioned this pull request Aug 6, 2025

Fix RowConverter roundtrip for List of Dictionary #8067

Closed

update field with flattend data type in decode_columns

ed4c20e

ding-young force-pushed the fix-row-converter branch from 830a059 to ed4c20e Compare August 7, 2025 01:39

alamb marked this pull request as ready for review August 7, 2025 16:42

improve unit test to compare array contents

847703e

ding-young force-pushed the fix-row-converter branch from 210922c to 847703e Compare August 18, 2025 05:46

ding-young commented Aug 18, 2025

View reviewed changes

arrow-row/src/lib.rs Show resolved Hide resolved

arrow-row/src/lib.rs Show resolved Hide resolved

alamb mentioned this pull request Aug 18, 2025

arrow-row: Document dictionary handling #8168

Merged

alamb reviewed Aug 18, 2025

View reviewed changes

alamb changed the title ~~Fix RowConverter panic with for deeply nested structures~~ Fix RowConverter panic when encoding DictionaryArrays in StructArrays and ListArrays Aug 18, 2025

refactor test code

90c7139

ding-young requested a review from alamb August 19, 2025 05:43

alamb approved these changes Aug 19, 2025

View reviewed changes

alamb changed the title ~~Fix RowConverter panic when encoding DictionaryArrays in StructArrays and ListArrays~~ Fix RowConverter panic when encoding DictionaryArrays in StructArray / ListArray Aug 19, 2025

alamb merged commit 19b4458 into apache:main Aug 20, 2025
14 checks passed

Fix RowConverter panic when encoding DictionaryArrays in StructArray / ListArray #7627

Fix RowConverter panic when encoding DictionaryArrays in StructArray / ListArray #7627

Uh oh!

Conversation

ding-young commented Jun 8, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

LiaCastaneda commented Aug 6, 2025

Uh oh!

ding-young commented Aug 6, 2025

Uh oh!

LiaCastaneda commented Aug 6, 2025

Uh oh!

LiaCastaneda commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ding-young commented Aug 7, 2025

Uh oh!

LiaCastaneda commented Aug 7, 2025

Uh oh!

ding-young left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix RowConverter panic when encoding `DictionaryArray`s in `StructArray` / `ListArray` #7627

Fix RowConverter panic when encoding `DictionaryArray`s in `StructArray` / `ListArray` #7627

ding-young commented Jun 8, 2025 •

edited by alamb

Loading

LiaCastaneda commented Aug 6, 2025 •

edited

Loading