-
Notifications
You must be signed in to change notification settings - Fork 1.1k
ArrayData Enumeration for Primitive, Binary and UTF8 #3749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
arrow-buffer/src/buffer/boolean.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-data/src/data/bytes.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This two-level formulation is somewhat arcane, but allows BytesArrayData to be typed solely on actual physical types, without needing a logical ByteArrayType.
arrow-data/src/data/types.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mirrors what the eventual ArrayData enumeration will look like, with each of these variants corresponding to an ArrayData variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only used for validating physical type when constructing typed ArrayData like PrimitiveArrayData.new does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I anticipate it being used in other safe array data constructors, but yes this is the only current use-case
07c79dc to
c1527b5
Compare
|
|
||
| /// Cast [`BytesArrayData`] to [`ArrayDataBytesOffset`] | ||
| fn upcast<B: Bytes + ?Sized>( | ||
| v: BytesArrayData<Self, B>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What difference between ArrayDataBytes and BytesArrayData? Seems ArrayDataBytes is the data of bytes types (binary and str). But BytesArrayData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it after reading the enums and struct BytesArrayData below. Maybe a descriptive comment at the top helpful.
| pub trait Primitive: private::PrimitiveSealed + ArrowNativeType { | ||
| const VARIANT: PrimitiveType; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this trait and the const need document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add some high-level documentation of the ArrayData "pattern" when I switch ArrayData over (and make these public)
| assert!( | ||
| matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT), | ||
| "Illegal physical type for PrimitiveArrayData, expected {:?} got {:?}", | ||
| T::VARIANT, | ||
| physical | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert!( | |
| matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT), | |
| "Illegal physical type for PrimitiveArrayData, expected {:?} got {:?}", | |
| T::VARIANT, | |
| physical | |
| ); | |
| assert!( | |
| matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT), | |
| "Illegal physical type for PrimitiveArrayData of datatype {:?}, expected {:?} got {:?}", | |
| data_type, | |
| T::VARIANT, | |
| physical | |
| ); |
| pub enum OffsetType { | ||
| Small, | ||
| Large, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as this is for physical type, should it be something like Int32, Int64?
arrow-data/src/data/types.rs
Outdated
| DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary), | ||
| DataType::FixedSizeBinary(_) => Self::FixedSizeBinary, | ||
| DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep variable length binary close.
| DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary), | |
| DataType::FixedSizeBinary(_) => Self::FixedSizeBinary, | |
| DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary), | |
| DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary), | |
| DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary), | |
| DataType::FixedSizeBinary(_) => Self::FixedSizeBinary, |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start.
|
Benchmark runs are scheduled for baseline = 96791ea and contender = dae7a71. dae7a71 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #1799
Relates to #1176
Closes #1802
Rationale for this change
This starts the process of fleshing out what the elements of the
ArrayDataenumeration will consist of, with implementations for primitive and byte arrays. I wanted to get something up for review to get feedback on the general direction, and to also avoid a single massive PR.What changes are included in this PR?
This PR adds enumerations
ArrayDataPrimitiveandArrayDataBytes, that consist of the enumerable variants of the genericPrimitiveArrayDataandBytesArrayData.These provide for an idiomatic way to enumerate the possible array types, with the view that this would eventually replace the current trait object approach.
Additionally
downcast/downcast_ref/Fromare provided to facilitate going from the enumeration to/from the generic form. This is critical to allowing types such asPrimitiveArrayto be wrappers aroundPrimitiveArrayDatawhilst being constructable from anArrayDataenumeration containingArrayDataPrimitive.I anticipate functionality gradually being moved onto these ArrayData abstractions, to allow code-sharing with arrow2, and to reduce monomorphisation overheads for PrimitiveArray, but I've purposefully kept them simple for now
Are there any user-facing changes?