-
Notifications
You must be signed in to change notification settings - Fork 41
chore(rust/sedona-spatial-join): More accurate batch in-memory size estimation #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(rust/sedona-spatial-join): More accurate batch in-memory size estimation #515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves memory usage estimation accuracy for spatial join operations by implementing custom functions to calculate in-memory sizes of Arrow record batches and arrays, replacing the built-in get_array_memory_size method which can produce inaccurate results for spilled batches.
Changes:
- Added new utility module
arrow_utils.rswith custom memory size estimation functions - Updated memory size calculation methods to return
Result<usize>instead ofusizeto handle potential errors - Modified all call sites to propagate errors from the new estimation functions
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/utils/arrow_utils.rs | Implements custom memory size estimation functions for RecordBatch and Array types, including special handling for BinaryView and Utf8View types |
| rust/sedona-spatial-join/src/utils.rs | Exports the new arrow_utils module |
| rust/sedona-spatial-join/src/operand_evaluator.rs | Updates in_mem_size() to return Result and use new estimation function |
| rust/sedona-spatial-join/src/index/spatial_index_builder.rs | Updates add_batch() to handle Result return type and propagate errors |
| rust/sedona-spatial-join/src/index/spatial_index.rs | Updates test code to handle new error-returning add_batch() |
| rust/sedona-spatial-join/src/index/build_side_collector.rs | Updates memory size calculation to handle Result type |
| rust/sedona-spatial-join/src/evaluated_batch.rs | Updates in_mem_size() to return Result and use new estimation function |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Estimate the in-memory size of a given Arrow array. This function estimates the | ||
| /// size as if the underlying buffers were copied to somewhere else and not shared. | ||
| pub(crate) fn get_array_memory_size(array: &ArrayRef) -> Result<usize> { | ||
| let array_data = array.to_data(); | ||
| let size = count_array_data_memory_size(&array_data)?; | ||
| Ok(size) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already using get_slice_memory_size under the hood. However, get_slice_memory_size does not take buffers of binary/utf8 views into account, count_array_data_memory_size is meant to fix that.
| let mut result: usize = 0; | ||
| let array_data_type = array_data.data_type(); | ||
|
|
||
| if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it is missing a branch for non BinaryView types. I imagine we aren't trying to ignore the memory occupied by non-view types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-view types were already taken care by get_slice_memory_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I suggested a comment so that future me also doesn't think that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment was added to get_array_data_memory_size.
| // | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | | ||
| // |------------|------------|------------|-------------| | ||
| // | length | prefix | buf. index | offset | | ||
| let views = &array_data.buffer::<u128>(0)[..array_data.len()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this take into account the offset?
| let views = &array_data.buffer::<u128>(0)[..array_data.len()]; | |
| let views = &array_data.buffer::<u128>(0)[array_data.offset()..array_data.len()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look correct to me. ArrayData::buffer has already taken offset into consideration: https://github.com/apache/arrow-rs/blob/57.2.0/arrow-data/src/data.rs#L580-L588
I'll add the test cases for buffers not starting at 0 anyway.
| "Long string that is definitely longer than 12 bytes", | ||
| "Another long string to make buffer larger", | ||
| ]); | ||
| let sliced = array.slice(0, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also test array.slice(2, 1) to test the non-zero offset case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array.slice(2, 1) is invalid since the array contains only 2 rows. I have tested array.slice(1, 1) and array.slice(2, 0) instead.
| let array_ref: ArrayRef = Arc::new(struct_array); | ||
| let size = get_array_memory_size(&array_ref).unwrap(); | ||
| // 83 (StringView) + 1 (Boolean values) = 84 | ||
| assert_eq!(size, 84); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think array_ref.slice(0, 1) will give the wrong answer with the current code (the slice is not projected into the children)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this test case? (Also OK to punt with a comment in the appropriate location that we arenot calculating correct memory usage of sliced lists/structs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added array_ref.slice(0, 1) test below and it gave correct result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337
It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396
...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to leave get_array_data_memory_size as is and log a ticket. #519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also added a XFAIL test case for list array slices.
| let mut result: usize = 0; | ||
| let array_data_type = array_data.data_type(); | ||
|
|
||
| if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I suggested a comment so that future me also doesn't think that 🙂
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
|
I have added the requested tests and also renamed the private function |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the nits on offset/length (I still have nightmares from implementing nested offsets in nanoarrow). I think this is still not right for some cases but feel free to file an issue and follow up later.
| let array_ref: ArrayRef = Arc::new(struct_array); | ||
| let size = get_array_memory_size(&array_ref).unwrap(); | ||
| // 83 (StringView) + 1 (Boolean values) = 84 | ||
| assert_eq!(size, 84); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337
It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396
...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.
This patch improves the accuracy of memory usage estimation by implementing our own functions for estimating the in-memory sizes of record batches and arrow arrays.
The rationale is similar to apache/datafusion#13377. If we don't roll our own memory usage estimation function but call
RecordBatch::get_array_memory_sizeinstead, we'll get insanely inaccurate numbers for spilled batches read usingarrow::ipc::reader::StreamReader.Future work: use the memory pool API of arrow-rs for more accurate memory usage accounting. See apache/arrow-rs#8137.