-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Support array shredding into List/LargeList/ListView/LargeListView
#8831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @scovich, you might be interested |
| enum ArrayVariantToArrowRowBuilder<'a> { | ||
| List(VariantToListArrowRowBuilder<'a, i32>), | ||
| LargeList(VariantToListArrowRowBuilder<'a, i64>), | ||
| ListView(VariantToListViewArrowRowBuilder<'a, i32>), | ||
| LargeListView(VariantToListViewArrowRowBuilder<'a, i64>), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to move this into variant_to_arrow, and variant_get could get the support out of the box. Given the size of this PR, I would do that as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
Definitely interested, but a bit overbooked last week. Hoping I can get to it this week. |
# Conflicts: # parquet-variant-compute/src/shred_variant.rs
| match value { | ||
| Variant::List(list) => { | ||
| self.value_builder.append_null(); | ||
| self.typed_value_builder.append_value(list)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking -- if I try to shred as List<i32> and I encounter a variant array [..., "hi", ...], the bad entry will either become NULL or cause an error, depending on cast options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the hi will be located in typed_value.value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the
hiwill be located intyped_value.value
that's right. For example, an variant array [1, 2, 3, 1, "two", Variant::Null] would become
typed_value: [1, 2, 3, 1, null, null]
value: [null, null, null, null, "two", Variant::Null]Can refer to test_array_shredding_as_list as an example
| List(VariantToListArrowRowBuilder<'a, i32>), | ||
| LargeList(VariantToListArrowRowBuilder<'a, i64>), | ||
| ListView(VariantToListViewArrowRowBuilder<'a, i32>), | ||
| LargeListView(VariantToListViewArrowRowBuilder<'a, i64>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce a ListLikeArrayBuilder trait (**) that encapsulates the (minimal) differences between these four types, so that ArrayVariantToArrowRowBuilder becomes a generic struct instead of an enum?
(**) c.f. StringLikeArrayBuilder that serves the same purpose for strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick analysis suggests the trait needs:
- An associated type:
type Offset: OffsetSizeTrait - A constructor:
fn try_new(...) -> Result<Self> - Helper functions to support
append_nullandappend_value(nulls, offsets, etc) - A finisher:
fn finish(self) -> Result<ArrayRef>
Two trait implementations (one for lists and one for list views), both generic over Offset
And from there, the outer builder should be able to implement its own logic just once instead of four times.
Double check tho -- the above is a very rough sketch. The goal is to minimize boilerplate and duplication, using a careful selection of trait methods that capture the essential differences between lists and list views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to refactor the code using ListLikeArrayBuilder, but it doesn’t help much (draft PR). Plus, it would also force the builder into a boxed dyn trait, adding overhead versus the enum approach.
I think the current impl actaully follow what we did in varaint_to_arrow, the code structure looks like below
Variant (VariantToShreddedVariantRowBuilder)
├─ Primitive (VariantToShreddedPrimitiveVariantRowBuilder)
│ ├─ Int/Float group (PrimitiveVariantToArrowRowBuilder::VariantToPrimitiveArrowRowBuilder):
│ │ Null, Boolean, Int8/16/32/64, UInt8/16/32/64,
│ │ Float16/32/64, Decimal32/64/128/256
│ ├─ Time-like (PrimitiveVariantToArrowRowBuilder::VariantToTimestampArrowRowBuilder):
│ │ Date32, Time64(us), Timestamp(us/ns) [tz or ntz]
│ └─ String/Binary (PrimitiveVariantToArrowRowBuilder::VariantToStringArrowRowBuilder/VariantToBinaryArrowRowBuilder):
│ Utf8, Utf8View, Binary, BinaryView
├─ Array (VariantToShreddedArrayVariantRowBuilder)
│ ├─ List (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│ ├─ LargeList (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│ ├─ ListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│ └─ LargeListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│ └─ each element uses make_variant_to_shredded_variant_arrow_row_builder recursively
└─ Object (VariantToShreddedObjectVariantRowBuilder)
StringLikeArrayBuilder and BinaryLikeArrayBuilder mainly simplify VariantToStringArrowRowBuilder and VariantToBinaryArrowRowBuilder but still need dispatch from PrimitiveVariantToArrowRowBuilder. In theory, ListLikeArrayBuilder could simplify nulls, offsets, and sizes via GenericList(View)Builder, but we’d need a lot of adapters, because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the VariantToListViewArrowRowBuilder and VariantToListArrowRowBuilder here if introducing ListLikeArrayBuilder, like PrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringLikeArrayBuilderandBinaryLikeArrayBuildermainly simplifyVariantToStringArrowRowBuilderandVariantToBinaryArrowRowBuilderbut still need dispatch fromPrimitiveVariantToArrowRowBuilder. In theory,ListLikeArrayBuildercould simplifynulls,offsets, andsizesviaGenericList(View)Builder, but we’d need a lot of adapters, becauseelement_builderis Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implementArrayBuilderdirectly to make the trait work in our case.
I think the problem is the draft PR still left the hierarchical builder thing in place? Double enum dispatch? I thought the other -like traits eliminated the double dispatch. The top-level enum has an enum variant for each concrete type, but each one is just instantiating a different version of the same generic builder?
(I wouldn't have expected the inner builder to be related to this refactor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, I don't think there is a double enum dispatch in this PR, this is the only place where we match the DataType to specific builders. ArrayVariantToArrowRowBuilder is on the same level as PrimitiveVariantToArrowRowBuilder and ObjectVariantToArrowRowBuilder
In PrimitiveVariantToArrowRowBuilder (top-level enum):
arrow-rs/parquet-variant-compute/src/variant_to_arrow.rs
Lines 66 to 71 in d212713
| String(VariantToStringArrowBuilder<'a, StringBuilder>), | |
| LargeString(VariantToStringArrowBuilder<'a, LargeStringBuilder>), | |
| StringView(VariantToStringArrowBuilder<'a, StringViewBuilder>), | |
| Binary(VariantToBinaryArrowRowBuilder<'a, BinaryBuilder>), | |
| LargeBinary(VariantToBinaryArrowRowBuilder<'a, LargeBinaryBuilder>), | |
| BinaryView(VariantToBinaryArrowRowBuilder<'a, BinaryViewBuilder>), |
it dispatches the builders for each concrete types.
And this PR did the same thing here in ArrayVariantToArrowRowBuilder, where this top-level enum dispatches 4 different list builders for each concrete list types.
BTW, I updated the draft PR to make ArrayVariantToArrowRowBuilder fairly slim, but it basically moves the match data_type from this builder to the caller side and does not reduce duplicate much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the
VariantToListViewArrowRowBuilderandVariantToListArrowRowBuilderhere if introducingListLikeArrayBuilder, likePrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?
Indeed we can if we make it the same way like StringLikeArrayBuilder/BinaryLikeArrayBuilder, then we don't need to maintain nulls, offsets, and sizes here. But it needs some adapters because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case. I may use another PR to see if we can easily simplify the code here
| match value { | ||
| Variant::List(list) => { | ||
| self.value_builder.append_null(); | ||
| self.typed_value_builder.append_value(list)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the hi will be located in typed_value.value
25f9b5e to
ca9fd49
Compare
klion26
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side, maybe we need to double-check the ListLikeArrayBuilder suggestion.
| List(VariantToListArrowRowBuilder<'a, i32>), | ||
| LargeList(VariantToListArrowRowBuilder<'a, i64>), | ||
| ListView(VariantToListViewArrowRowBuilder<'a, i32>), | ||
| LargeListView(VariantToListViewArrowRowBuilder<'a, i64>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the VariantToListViewArrowRowBuilder and VariantToListArrowRowBuilder here if introducing ListLikeArrayBuilder, like PrimitiveVariantToArrowRowBuilder(VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>) ?
|
Hi @scovich, would appreciate if you could take another look! |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General approach LGTM.
Two restructurings to consider (see comments):
- Avoid
#[cfg(test)]clutter in ListLikeArray trait - Eliminate duplication of the list builder vs. list view builder
| struct VariantToListViewArrowRowBuilder<'a, O: OffsetSizeTrait + ArrowNativeTypeOp> { | ||
| field: FieldRef, | ||
| offsets: Vec<O>, | ||
| sizes: Vec<O>, | ||
| element_builder: Box<VariantToShreddedVariantRowBuilder<'a>>, | ||
| nulls: NullBufferBuilder, | ||
| current_offset: O, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super similar to the variant list builder. AFAICT, the only meaningful difference is how offsets and sizes are handled? And for this specific use case, we could compute both from the normal offsets array, e.g.:
offsets: [0, a, b, ..., y, z]
translates in view form to
offsets: [0, a, b, ..., y]
sizes: [a-0, b-a, ..., z-y]
Thus, a single implementation (generic over const IS_VIEW: bool) could have a uniform implementation for everything except the final array build?
if IS_VIEW {
let n = self.offsets.len();
let sizes = Vec::with_capacity(n - 1)
for i in 1..n {
sizes.push(self.offsets[i] - self.offsets[i-1]);
}
self.offsets.pop(); // drop the now-unneeded extra entry
// create a ListViewArray from offsets and sizes
} else {
// create a ListArray from offsets
}The only downside is allocating one extra element of capacity in case of a list view... but that seems highly unlikely to matter in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, we can merge these 2 builders with the new const
# Conflicts: # parquet-variant-compute/src/shred_variant.rs
65d3637 to
6552d09
Compare
6552d09 to
dd9885d
Compare
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few code readability nits to consider before merge.
|
Thank you @liamzwbao @klion26 @scovich and @sdf-jkl -- this is quite a team effort. |
Which issue does this PR close?
List/LargeList/ListView/LargeListView#8830.Rationale for this change
What changes are included in this PR?
Implement array shredding into
List/LargeList/ListView/LargeListViewto close the gaps inshred_variant. Part of the changes lay the groundwork for addingvariant_getsupport for list types in a follow-up.Are these changes tested?
Yes
Are there any user-facing changes?
New shredding types supported