Skip to content

Conversation

@ding-young
Copy link
Contributor

@ding-young ding-young commented Jun 8, 2025

Which issue does this PR close?

Rationale for this change

Although RowConverter flattens data type on converting columns into rows, it builds array with original SortField which contains unflattened types when converting rows back into columns. Therefore, output array has inconsistent data type although it is actually flattened.

What changes are included in this PR?

When decoding columns, instead of using original SortField, it uses new field with updated data type of child ArrayData, which is flattened.

I've also considered alternative approaches like recursively modify all the fields on convert_rows or convert_raw, but considering that we already visit each field recursively, I just corrected the field in decode_column for simplicity.

I'd be happy to hear about the feedback on correctness of this pr and any other suggestions.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 8, 2025
@alamb alamb changed the title Update field with child data type in decode_columns Fix RowConverter panic with for deeply nested structures Jun 9, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @ding-young

I am sorry I don't fully understand what is wrong and what this PR is proposing.

Although RowConverter flattens data type on converting columns into rows, it builds array with original SortField which contains unflattened types when converting rows back into columns.

I believe the intention is to build arrays with the original DataTypes.

Specifically, I expect the RowConverter to have the property that if you convert Arrays to Rows and then those Rows back to Arrays, the arrays are the same that went in

@alamb alamb marked this pull request as draft June 16, 2025 10:21
@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@LiaCastaneda
Copy link

👋 @ding-young what is the status of this PR? do you mind if I give it a shot? it is causing errors in DataFusion apache/datafusion#17012

@ding-young
Copy link
Contributor Author

Hi @LiaCastaneda :)

I just ran the reproducer test, and it passes without error on the fix-row-converter branch.
image

However, this PR takes the approach of correcting the data type, rather than preserving the original dictionary encoding. So #7627 (comment) the resulting type is flattened, like mentioned here #7169 (comment).

I can easily follow up on the minor review comments in a day, but currently I don’t have time to rewrite the decoding logic to preserve the original encoding like @alamb suggested.

What kind of solution were you thinking of? If your main goal is just to suppress the warning and run the reproducer query, then I can quickly finish up this PR and ping you again. But if you’re planning to work on a solution that preserves the original dictionary encoding, feel free to take it over!

Let me know what you think :)

@LiaCastaneda
Copy link

Guiding myself on @alamb 's comment, I understand the correct solution is to preserve Dict encoding so I was looking into following that approach.

@LiaCastaneda
Copy link

LiaCastaneda commented Aug 6, 2025

I'm sketching a solution for this, and I'm wondering if we want to keep the exact same encoding, one thing that I think we could do is manually build a Dictionary using a DictionaryBuilder or perform a cast using arrow_cast::cast which does the same under the hood. This would keep the DataType but i'm not sure it will have the exact same encoding as the original Dict Array. If we want to keep the exact same Dict encoding then we would probably need to keep the original DictionaryArray values somewhere before/during encoding the Array to rows.

Update: I tried the cast approach here https://github.com/apache/arrow-rs/pull/8067/files and I believe the encoding would be the same, if not the test I included would not pass. I will try to open the PR later today or tomorrow

@ding-young
Copy link
Contributor Author

If #8067 (comment) that's the case, would you @tustvold mind taking a look when you get a chance?

Plus, let me know if there’s anything else I can help with :) @LiaCastaneda .

@LiaCastaneda
Copy link

Thanks for readressing the PR @ding-young 🙇‍♀️

@alamb alamb marked this pull request as ready for review August 7, 2025 16:42
Copy link
Contributor Author

@ding-young ding-young left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the unit tests to compare each value and check if the data (logically) remains the same. thanks @gabotechs @alamb

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

I updated the unit tests to compare each value and check if the data (logically) remains the same. thanks @gabotechs @alamb

Yes, I think keeping the values logically the same (with the same datatype) is what is desired

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

I am starting to check this PR out

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ding-young and @LiaCastaneda

After some more review, I think this PR follows the guidance of @tustvold on #7165 (comment)

all dictionaries are expanded on encode, so I would expect convert_rows to product a List of Int32.

I am sorry I was confused earlier about the intended design. I left a few comments, but I think this PR is very close.

I also think it is important to improve the documentation to clarify the expected behavior in this case. I have created a proposed PR here:

vec![Box::new(builder)],
);
let dict_builder = struct_builder
.field_builder::<PrimitiveDictionaryBuilder<Int32Type, Int32Type>>(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please change this so it uses different types for the keys and values -- otherwise it is hard to keep track of the key and values in the code below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the dictionary type to Dictionary(Int32, Utf8) to improve readability. Let me know if anything in the test is still confusing or hard to follow.

@alamb alamb changed the title Fix RowConverter panic with for deeply nested structures Fix RowConverter panic when encoding DictionaryArrays in StructArrays and ListArrays Aug 18, 2025
@ding-young ding-young requested a review from alamb August 19, 2025 05:43
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ding-young and @LiaCastaneda

@alamb alamb changed the title Fix RowConverter panic when encoding DictionaryArrays in StructArrays and ListArrays Fix RowConverter panic when encoding DictionaryArrays in StructArray / ListArray Aug 19, 2025
@alamb alamb merged commit 19b4458 into apache:main Aug 20, 2025
14 checks passed
alamb added a commit that referenced this pull request Aug 21, 2025
# Which issue does this PR close?

- related to #7627
- Related to #4811

# Rationale for this change

It was not clear to me what the expected behavior for round trip through
row converter was for DictionaryArrays, so let's document what @tustvold
says here:
#8067 (comment)

> I think the issue is that Datafusion is not handling the fact that row
encoding "hydrates" dictionaries. It should be updated to understand
that List<Dictionary<...>> will be converted to List<...>, much like it
already handles this for the non-nested case. Converting back to a
dictionary is expensive, and likely pointless, not to mention a breaking
change.


# What changes are included in this PR?

Document expected behavior with english comments and doc test

# Are these changes tested?

Yes (doctests)

# Are there any user-facing changes?

More docs, no behavior change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RowConverter::convert_rows can return an invalid array if Struct contains Dictionary RowConverter::convert_rows panics when decoding List(Dictionary)

4 participants