-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
DataFusion uses BooleanBuffer in several places to create Null buffers. I thought there was a clever optimization for handling data with no nulls which I filed in arrow-rs
However, @tustvold pointed out that NullBufferBuilder has exactly the optimization described:
I looked at the DataFusion codebase and found we have several examples of using BooleanBufferBuilder rather than NullBufferBuilder:
https://github.com/search?q=repo%3Aapache%2Fdatafusion%20BooleanBufferBuilder&type=code
It even has a reimplementation of the NullBufferBuilder optimization 🤦 :
datafusion/datafusion/physical-plan/src/aggregates/group_values/null_builder.rs
Lines 20 to 32 in 63b94c8
| /// Builder for an (optional) null mask | |
| /// | |
| /// Optimized for avoid creating the bitmask when all values are non-null | |
| #[derive(Debug)] | |
| pub(crate) enum MaybeNullBufferBuilder { | |
| /// seen `row_count` rows but no nulls yet | |
| NoNulls { row_count: usize }, | |
| /// have at least one null value | |
| /// | |
| /// Note this is an Arrow *VALIDITY* buffer (so it is false for nulls, true | |
| /// for non-nulls) | |
| Nulls(BooleanBufferBuilder), | |
| } |
Describe the solution you'd like
I would like to switch DataFusion to using NullBufferBuilder instead of BooleanBufferBuilder as much as possible
Note that until the following PR is availble, this will involve adding an explicit dependency on arrow_buffer
Describe alternatives you've considered
No response
Additional context
No response