Skip to content

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implement array shredding into List/LargeList/ListView/LargeListView to close the gaps in shred_variant. Part of the changes lay the groundwork for adding variant_get support for list types in a follow-up.

Are these changes tested?

Yes

Are there any user-facing changes?

New shredding types supported

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Nov 12, 2025
@liamzwbao liamzwbao marked this pull request as ready for review November 12, 2025 23:53
@liamzwbao
Copy link
Contributor Author

Hi @scovich, you might be interested

Comment on lines 286 to 291
enum ArrayVariantToArrowRowBuilder<'a> {
List(VariantToListArrowRowBuilder<'a, i32>),
LargeList(VariantToListArrowRowBuilder<'a, i64>),
ListView(VariantToListViewArrowRowBuilder<'a, i32>),
LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to move this into variant_to_arrow, and variant_get could get the support out of the box. Given the size of this PR, I would do that as a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

@scovich
Copy link
Contributor

scovich commented Nov 17, 2025

Hi @scovich, you might be interested

Definitely interested, but a bit overbooked last week. Hoping I can get to it this week.

# Conflicts:
#	parquet-variant-compute/src/shred_variant.rs
match value {
Variant::List(list) => {
self.value_builder.append_null();
self.typed_value_builder.append_value(list)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking -- if I try to shred as List<i32> and I encounter a variant array [..., "hi", ...], the bad entry will either become NULL or cause an error, depending on cast options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

that's right. For example, an variant array [1, 2, 3, 1, "two", Variant::Null] would become

typed_value: [1, 2, 3, 1, null, null]
value: [null, null, null, null, "two", Variant::Null]

Can refer to test_array_shredding_as_list as an example

Comment on lines 313 to 316
List(VariantToListArrowRowBuilder<'a, i32>),
LargeList(VariantToListArrowRowBuilder<'a, i64>),
ListView(VariantToListViewArrowRowBuilder<'a, i32>),
LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce a ListLikeArrayBuilder trait (**) that encapsulates the (minimal) differences between these four types, so that ArrayVariantToArrowRowBuilder becomes a generic struct instead of an enum?

(**) c.f. StringLikeArrayBuilder that serves the same purpose for strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick analysis suggests the trait needs:

  • An associated type: type Offset: OffsetSizeTrait
  • A constructor: fn try_new(...) -> Result<Self>
  • Helper functions to support append_null and append_value (nulls, offsets, etc)
  • A finisher: fn finish(self) -> Result<ArrayRef>

Two trait implementations (one for lists and one for list views), both generic over Offset

And from there, the outer builder should be able to implement its own logic just once instead of four times.

Double check tho -- the above is a very rough sketch. The goal is to minimize boilerplate and duplication, using a careful selection of trait methods that capture the essential differences between lists and list views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor the code using ListLikeArrayBuilder, but it doesn’t help much (draft PR). Plus, it would also force the builder into a boxed dyn trait, adding overhead versus the enum approach.

I think the current impl actaully follow what we did in varaint_to_arrow, the code structure looks like below

Variant (VariantToShreddedVariantRowBuilder)
├─ Primitive (VariantToShreddedPrimitiveVariantRowBuilder)
│  ├─ Int/Float group (PrimitiveVariantToArrowRowBuilder::VariantToPrimitiveArrowRowBuilder):
│  │    Null, Boolean, Int8/16/32/64, UInt8/16/32/64,
│  │    Float16/32/64, Decimal32/64/128/256
│  ├─ Time-like (PrimitiveVariantToArrowRowBuilder::VariantToTimestampArrowRowBuilder):
│  │    Date32, Time64(us), Timestamp(us/ns) [tz or ntz]
│  └─ String/Binary (PrimitiveVariantToArrowRowBuilder::VariantToStringArrowRowBuilder/VariantToBinaryArrowRowBuilder):
│       Utf8, Utf8View, Binary, BinaryView
├─ Array (VariantToShreddedArrayVariantRowBuilder)
│  ├─ List          (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│  ├─ LargeList     (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│  ├─ ListView      (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│  └─ LargeListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│      └─ each element uses make_variant_to_shredded_variant_arrow_row_builder recursively
└─ Object (VariantToShreddedObjectVariantRowBuilder)

StringLikeArrayBuilder and BinaryLikeArrayBuilder mainly simplify VariantToStringArrowRowBuilder and VariantToBinaryArrowRowBuilder but still need dispatch from PrimitiveVariantToArrowRowBuilder. In theory, ListLikeArrayBuilder could simplify nulls, offsets, and sizes via GenericList(View)Builder, but we’d need a lot of adapters, because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the VariantToListViewArrowRowBuilder and VariantToListArrowRowBuilder here if introducing ListLikeArrayBuilder, like PrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?

Copy link
Contributor

@scovich scovich Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringLikeArrayBuilder and BinaryLikeArrayBuilder mainly simplify VariantToStringArrowRowBuilder and VariantToBinaryArrowRowBuilder but still need dispatch from PrimitiveVariantToArrowRowBuilder. In theory, ListLikeArrayBuilder could simplify nulls, offsets, and sizes via GenericList(View)Builder, but we’d need a lot of adapters, because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case.

I think the problem is the draft PR still left the hierarchical builder thing in place? Double enum dispatch? I thought the other -like traits eliminated the double dispatch. The top-level enum has an enum variant for each concrete type, but each one is just instantiating a different version of the same generic builder?

(I wouldn't have expected the inner builder to be related to this refactor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, I don't think there is a double enum dispatch in this PR, this is the only place where we match the DataType to specific builders. ArrayVariantToArrowRowBuilder is on the same level as PrimitiveVariantToArrowRowBuilder and ObjectVariantToArrowRowBuilder

In PrimitiveVariantToArrowRowBuilder (top-level enum):

String(VariantToStringArrowBuilder<'a, StringBuilder>),
LargeString(VariantToStringArrowBuilder<'a, LargeStringBuilder>),
StringView(VariantToStringArrowBuilder<'a, StringViewBuilder>),
Binary(VariantToBinaryArrowRowBuilder<'a, BinaryBuilder>),
LargeBinary(VariantToBinaryArrowRowBuilder<'a, LargeBinaryBuilder>),
BinaryView(VariantToBinaryArrowRowBuilder<'a, BinaryViewBuilder>),

it dispatches the builders for each concrete types.

And this PR did the same thing here in ArrayVariantToArrowRowBuilder, where this top-level enum dispatches 4 different list builders for each concrete list types.

BTW, I updated the draft PR to make ArrayVariantToArrowRowBuilder fairly slim, but it basically moves the match data_type from this builder to the caller side and does not reduce duplicate much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the VariantToListViewArrowRowBuilder and VariantToListArrowRowBuilder here if introducing ListLikeArrayBuilder, like PrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?

Indeed we can if we make it the same way like StringLikeArrayBuilder/BinaryLikeArrayBuilder, then we don't need to maintain nulls, offsets, and sizes here. But it needs some adapters because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case. I may use another PR to see if we can easily simplify the code here

match value {
Variant::List(list) => {
self.value_builder.append_null();
self.typed_value_builder.append_value(list)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

@liamzwbao liamzwbao force-pushed the issue-8830-shred-list-variant branch from 25f9b5e to ca9fd49 Compare November 28, 2025 01:27
@liamzwbao
Copy link
Contributor Author

This is ready for review now. PTAL @scovich @klion26 @sdf-jkl, thanks!

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side, maybe we need to double-check the ListLikeArrayBuilder suggestion.

Comment on lines 313 to 316
List(VariantToListArrowRowBuilder<'a, i32>),
LargeList(VariantToListArrowRowBuilder<'a, i64>),
ListView(VariantToListViewArrowRowBuilder<'a, i32>),
LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the VariantToListViewArrowRowBuilder and VariantToListArrowRowBuilder here if introducing ListLikeArrayBuilder, like PrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?

@liamzwbao
Copy link
Contributor Author

Hi @scovich, would appreciate if you could take another look!

@liamzwbao liamzwbao requested a review from scovich December 11, 2025 18:57
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach LGTM.

Two restructurings to consider (see comments):

  • Avoid #[cfg(test)] clutter in ListLikeArray trait
  • Eliminate duplication of the list builder vs. list view builder

Comment on lines 462 to 469
struct VariantToListViewArrowRowBuilder<'a, O: OffsetSizeTrait + ArrowNativeTypeOp> {
field: FieldRef,
offsets: Vec<O>,
sizes: Vec<O>,
element_builder: Box<VariantToShreddedVariantRowBuilder<'a>>,
nulls: NullBufferBuilder,
current_offset: O,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super similar to the variant list builder. AFAICT, the only meaningful difference is how offsets and sizes are handled? And for this specific use case, we could compute both from the normal offsets array, e.g.:

offsets: [0, a, b, ..., y, z]

translates in view form to

offsets: [0, a, b, ..., y]
sizes: [a-0, b-a, ..., z-y]

Thus, a single implementation (generic over const IS_VIEW: bool) could have a uniform implementation for everything except the final array build?

if IS_VIEW {
    let n = self.offsets.len();
    let sizes = Vec::with_capacity(n - 1)
    for i in 1..n {
        sizes.push(self.offsets[i] - self.offsets[i-1]);
    }
    self.offsets.pop(); // drop the now-unneeded extra entry
    // create a ListViewArray from offsets and sizes
} else {
    // create a ListArray from offsets
}

The only downside is allocating one extra element of capacity in case of a list view... but that seems highly unlikely to matter in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, we can merge these 2 builders with the new const

@liamzwbao liamzwbao force-pushed the issue-8830-shred-list-variant branch from 65d3637 to 6552d09 Compare December 12, 2025 19:06
@liamzwbao liamzwbao force-pushed the issue-8830-shred-list-variant branch from 6552d09 to dd9885d Compare December 12, 2025 19:14
@liamzwbao
Copy link
Contributor Author

liamzwbao commented Dec 12, 2025

Thank you for your advice, @scovich! I have updated the PR.

Also cc @alamb as we are close to a final review for this PR

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few code readability nits to consider before merge.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

Thank you @liamzwbao @klion26 @scovich and @sdf-jkl -- this is quite a team effort.

@alamb alamb merged commit e49c2ed into apache:main Dec 15, 2025
17 checks passed
@liamzwbao liamzwbao deleted the issue-8830-shred-list-variant branch January 7, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support array shredding into List/LargeList/ListView/LargeListView

5 participants