All arrow2 arrays are defined roughly as the following:
pub struct Array {
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
}
When you clone/slice/index an Array, you get another Array in roughly O(1) thanks to both the values and validity bitmaps being refcounted behind the scenes:
pub struct Buffer<T> {
data: Arc<Bytes<T>>,
offset: usize,
length: usize,
}
pub struct Bitmap {
bytes: Arc<Bytes<u8>>,
offset: usize,
length: usize,
unset_bits: usize,
}
Well... not really, turns out the DataType is not refcounted, and it can get huge: it's a massive heap-recursive enum potentially filled with strings and such.
Say you have a ListArray that contains a bunch of StructArrays (i.e. a column of component data) and you want to extract references to the individual StructArrays in that list (i.e. the individual DataCells): each of these arrays is now going to carry a full copy of the StructArray's schema.
For tiny DataCells (which are very common in Rerun), the overhead is enormous.
All
arrow2arrays are defined roughly as the following:When you clone/slice/index an
Array, you get anotherArrayin roughlyO(1)thanks to both thevaluesandvaliditybitmaps being refcounted behind the scenes:Well... not really, turns out the
DataTypeis not refcounted, and it can get huge: it's a massive heap-recursive enum potentially filled with strings and such.Say you have a
ListArraythat contains a bunch ofStructArrays (i.e. a column of component data) and you want to extract references to the individualStructArrays in that list (i.e. the individualDataCells): each of these arrays is now going to carry a full copy of theStructArray's schema.For tiny
DataCells (which are very common in Rerun), the overhead is enormous.