Skip to content

Add MutableArrayData::extend_ranges #1229

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

MutableArrayData is created with one or more ArrayData and can be used to copy across rows from the source arrays to a destination array. It does this by constructing the following for each of the arrays. These can then be used to copy a range of values from the source array's null mask and data respectively.

type ExtendNullBits<'a> = Box<dyn Fn(&mut _MutableArrayData, usize, usize) + 'a>;
type Extend<'a> = Box<dyn Fn(&mut _MutableArrayData, usize, usize, usize) + 'a>;

It then also constructs

type ExtendNulls = Box<dyn Fn(&mut _MutableArrayData, usize)>;

Which can be used to append null values to the in-progress array.

Users don't call these boxed functions directly, but instead call MutableArrayData::extend or MutableArrayData::extend_nulls which in turn call the appropriate functions.

This works really well for kernels such as concat which call MutableArrayData with large ranges, however, it performs poorly in kernels such as take and filter where the contiguous ranges may be very small.

Edit: The take kernel in fact has custom implementations for each array, likely because using MutableArrayData would be painfully slow, perhaps with this we could unify the implementations 🤔

Describe the solution you'd like

Modify the signatures of these functions to a slice of ranges, and add

MutableArrayData::extend_ranges(&mut self, index: usize, ranges: &[Range<usize>])

This will not only amortise the cost of the extend functions, but will also allow implementations to do more performant gather operations where possible

Additional context

The Filter returned by build_filter and used when filtering a record batch with more than one column, already computes a Vec of ranges - and so this would be effectively free.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions