Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I think this crate has pretty good stories for operating on individual columns, either by downcasting to a concrete type, or invoking a dyn kernel.
The stories for multi-column operations are substantially weaker, with patchy support for common multi-column operations such as sorts, groupings, aggregations, reassembly, etc... We have some pieces such as MutableArrayData, DynComparator, but they're not especially performant, making extensive use of dynamic dispatch at the row-level, nor easy to use.
Describe the solution you'd like
Having a first-class row representation will not only allow us to implement more performant versions of existing kernels such as lexsort, but also provide a pretty compelling primitive to downstreams with which to implement more advanced operations such as streaming merges, joins, aggregates, etc... There is also precedent, with the C++ arrow library providing its own row format.
Goals
- Each row should be encoded as a single sequence of bytes
- Comparison of the byte arrays should be sufficient to establish ordering of the rows
- It should be possible to convert a selection of rows back to arrays
Non-Goals
- Support introspection or mutation of the row values
- Provide a stable encoding for FFI, IO, etc...
- Provide "optimal" encoding, rather a reasonable out-of-the-box baseline for common use-cases
Describe alternatives you've considered
We could extend the row format in DataFusion, however, this would limit its benefits to DataFusion. I think a row-oriented representation is such a fundamental primitive that it makes sense for inclusion in arrow-rs, so that it can be both used in its kernels and by downstreams that don't make use of DataFusion.
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I think this crate has pretty good stories for operating on individual columns, either by downcasting to a concrete type, or invoking a
dynkernel.The stories for multi-column operations are substantially weaker, with patchy support for common multi-column operations such as sorts, groupings, aggregations, reassembly, etc... We have some pieces such as
MutableArrayData,DynComparator, but they're not especially performant, making extensive use of dynamic dispatch at the row-level, nor easy to use.Describe the solution you'd like
Having a first-class row representation will not only allow us to implement more performant versions of existing kernels such as lexsort, but also provide a pretty compelling primitive to downstreams with which to implement more advanced operations such as streaming merges, joins, aggregates, etc... There is also precedent, with the C++ arrow library providing its own row format.
Goals
Non-Goals
Describe alternatives you've considered
We could extend the row format in DataFusion, however, this would limit its benefits to DataFusion. I think a row-oriented representation is such a fundamental primitive that it makes sense for inclusion in arrow-rs, so that it can be both used in its kernels and by downstreams that don't make use of DataFusion.
Additional context