Skip to content

Conversation

@Kontinuation
Copy link
Member

This patch improves the accuracy of memory usage estimation by implementing our own functions for estimating the in-memory sizes of record batches and arrow arrays.

The rationale is similar to apache/datafusion#13377. If we don't roll our own memory usage estimation function but call RecordBatch::get_array_memory_size instead, we'll get insanely inaccurate numbers for spilled batches read using arrow::ipc::reader::StreamReader.

Future work: use the memory pool API of arrow-rs for more accurate memory usage accounting. See apache/arrow-rs#8137.

@Kontinuation Kontinuation requested a review from Copilot January 14, 2026 15:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves memory usage estimation accuracy for spatial join operations by implementing custom functions to calculate in-memory sizes of Arrow record batches and arrays, replacing the built-in get_array_memory_size method which can produce inaccurate results for spilled batches.

Changes:

  • Added new utility module arrow_utils.rs with custom memory size estimation functions
  • Updated memory size calculation methods to return Result<usize> instead of usize to handle potential errors
  • Modified all call sites to propagate errors from the new estimation functions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
rust/sedona-spatial-join/src/utils/arrow_utils.rs Implements custom memory size estimation functions for RecordBatch and Array types, including special handling for BinaryView and Utf8View types
rust/sedona-spatial-join/src/utils.rs Exports the new arrow_utils module
rust/sedona-spatial-join/src/operand_evaluator.rs Updates in_mem_size() to return Result and use new estimation function
rust/sedona-spatial-join/src/index/spatial_index_builder.rs Updates add_batch() to handle Result return type and propagate errors
rust/sedona-spatial-join/src/index/spatial_index.rs Updates test code to handle new error-returning add_batch()
rust/sedona-spatial-join/src/index/build_side_collector.rs Updates memory size calculation to handle Result type
rust/sedona-spatial-join/src/evaluated_batch.rs Updates in_mem_size() to return Result and use new estimation function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Kontinuation Kontinuation marked this pull request as ready for review January 14, 2026 16:34
Comment on lines 36 to 42
/// Estimate the in-memory size of a given Arrow array. This function estimates the
/// size as if the underlying buffers were copied to somewhere else and not shared.
pub(crate) fn get_array_memory_size(array: &ArrayRef) -> Result<usize> {
let array_data = array.to_data();
let size = count_array_data_memory_size(&array_data)?;
Ok(size)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already using get_slice_memory_size under the hood. However, get_slice_memory_size does not take buffers of binary/utf8 views into account, count_array_data_memory_size is meant to fix that.

let mut result: usize = 0;
let array_data_type = array_data.data_type();

if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is missing a branch for non BinaryView types. I imagine we aren't trying to ignore the memory occupied by non-view types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-view types were already taken care by get_slice_memory_size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I suggested a comment so that future me also doesn't think that 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment was added to get_array_data_memory_size.

// | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 |
// |------------|------------|------------|-------------|
// | length | prefix | buf. index | offset |
let views = &array_data.buffer::<u128>(0)[..array_data.len()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take into account the offset?

Suggested change
let views = &array_data.buffer::<u128>(0)[..array_data.len()];
let views = &array_data.buffer::<u128>(0)[array_data.offset()..array_data.len()];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct to me. ArrayData::buffer has already taken offset into consideration: https://github.com/apache/arrow-rs/blob/57.2.0/arrow-data/src/data.rs#L580-L588

I'll add the test cases for buffers not starting at 0 anyway.

"Long string that is definitely longer than 12 bytes",
"Another long string to make buffer larger",
]);
let sliced = array.slice(0, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also test array.slice(2, 1) to test the non-zero offset case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array.slice(2, 1) is invalid since the array contains only 2 rows. I have tested array.slice(1, 1) and array.slice(2, 0) instead.

Comment on lines +145 to +148
let array_ref: ArrayRef = Arc::new(struct_array);
let size = get_array_memory_size(&array_ref).unwrap();
// 83 (StringView) + 1 (Boolean values) = 84
assert_eq!(size, 84);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think array_ref.slice(0, 1) will give the wrong answer with the current code (the slice is not projected into the children)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this test case? (Also OK to punt with a comment in the appropriate location that we arenot calculating correct memory usage of sliced lists/structs?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added array_ref.slice(0, 1) test below and it gave correct result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337

It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396

...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.

Copy link
Member Author

@Kontinuation Kontinuation Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to leave get_array_data_memory_size as is and log a ticket. #519

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also added a XFAIL test case for list array slices.

let mut result: usize = 0;
let array_data_type = array_data.data_type();

if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I suggested a comment so that future me also doesn't think that 🙂

Kontinuation and others added 3 commits January 15, 2026 12:35
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@Kontinuation
Copy link
Member Author

I have added the requested tests and also renamed the private function count_array_data_memory_size to get_array_data_memory_size since the original name looked awful.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the nits on offset/length (I still have nightmares from implementing nested offsets in nanoarrow). I think this is still not right for some cases but feel free to file an issue and follow up later.

Comment on lines +145 to +148
let array_ref: ArrayRef = Arc::new(struct_array);
let size = get_array_memory_size(&array_ref).unwrap();
// 83 (StringView) + 1 (Boolean values) = 84
assert_eq!(size, 84);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337

It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396

...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.

@Kontinuation Kontinuation merged commit 01f56a9 into apache:main Jan 16, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants