chore(rust/sedona-spatial-join): More accurate batch in-memory size estimation #515

Kontinuation · 2026-01-14T15:29:04Z

This patch improves the accuracy of memory usage estimation by implementing our own functions for estimating the in-memory sizes of record batches and arrow arrays.

The rationale is similar to apache/datafusion#13377. If we don't roll our own memory usage estimation function but call RecordBatch::get_array_memory_size instead, we'll get insanely inaccurate numbers for spilled batches read using arrow::ipc::reader::StreamReader.

Future work: use the memory pool API of arrow-rs for more accurate memory usage accounting. See apache/arrow-rs#8137.

Copilot

Pull request overview

This PR improves memory usage estimation accuracy for spatial join operations by implementing custom functions to calculate in-memory sizes of Arrow record batches and arrays, replacing the built-in get_array_memory_size method which can produce inaccurate results for spilled batches.

Changes:

Added new utility module arrow_utils.rs with custom memory size estimation functions
Updated memory size calculation methods to return Result<usize> instead of usize to handle potential errors
Modified all call sites to propagate errors from the new estimation functions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
rust/sedona-spatial-join/src/utils/arrow_utils.rs	Implements custom memory size estimation functions for RecordBatch and Array types, including special handling for BinaryView and Utf8View types
rust/sedona-spatial-join/src/utils.rs	Exports the new arrow_utils module
rust/sedona-spatial-join/src/operand_evaluator.rs	Updates in_mem_size() to return Result and use new estimation function
rust/sedona-spatial-join/src/index/spatial_index_builder.rs	Updates add_batch() to handle Result return type and propagate errors
rust/sedona-spatial-join/src/index/spatial_index.rs	Updates test code to handle new error-returning add_batch()
rust/sedona-spatial-join/src/index/build_side_collector.rs	Updates memory size calculation to handle Result type
rust/sedona-spatial-join/src/evaluated_batch.rs	Updates in_mem_size() to return Result and use new estimation function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paleolimbot · 2026-01-14T18:56:00Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+/// Estimate the in-memory size of a given Arrow array. This function estimates the
+/// size as if the underlying buffers were copied to somewhere else and not shared.
+pub(crate) fn get_array_memory_size(array: &ArrayRef) -> Result<usize> {
+    let array_data = array.to_data();
+    let size = count_array_data_memory_size(&array_data)?;
+    Ok(size)
+}


Is https://docs.rs/arrow/latest/arrow/array/struct.ArrayData.html#method.get_slice_memory_size helpful here?

It is already using get_slice_memory_size under the hood. However, get_slice_memory_size does not take buffers of binary/utf8 views into account, count_array_data_memory_size is meant to fix that.

paleolimbot · 2026-01-14T18:56:52Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+    let mut result: usize = 0;
+    let array_data_type = array_data.data_type();
+
+    if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) {


This seems like it is missing a branch for non BinaryView types. I imagine we aren't trying to ignore the memory occupied by non-view types?

Non-view types were already taken care by get_slice_memory_size.

Got it! I suggested a comment so that future me also doesn't think that 🙂

Comment was added to get_array_data_memory_size.

paleolimbot · 2026-01-15T03:52:39Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+        // | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
+        // |------------|------------|------------|-------------|
+        // | length     | prefix     | buf. index | offset      |
+        let views = &array_data.buffer::<u128>(0)[..array_data.len()];


Should this take into account the offset?

Suggested change

let views = &array_data.buffer::<u128>(0)[..array_data.len()];

let views = &array_data.buffer::<u128>(0)[array_data.offset()..array_data.len()];

This does not look correct to me. ArrayData::buffer has already taken offset into consideration: https://github.com/apache/arrow-rs/blob/57.2.0/arrow-data/src/data.rs#L580-L588

I'll add the test cases for buffers not starting at 0 anyway.

rust/sedona-spatial-join/src/utils/arrow_utils.rs

paleolimbot · 2026-01-15T04:13:25Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+            "Long string that is definitely longer than 12 bytes",
+            "Another long string to make buffer larger",
+        ]);
+        let sliced = array.slice(0, 2);


We can also test array.slice(2, 1) to test the non-zero offset case

array.slice(2, 1) is invalid since the array contains only 2 rows. I have tested array.slice(1, 1) and array.slice(2, 0) instead.

paleolimbot · 2026-01-15T04:15:45Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+        let array_ref: ArrayRef = Arc::new(struct_array);
+        let size = get_array_memory_size(&array_ref).unwrap();
+        // 83 (StringView) + 1 (Boolean values) = 84
+        assert_eq!(size, 84);


I think array_ref.slice(0, 1) will give the wrong answer with the current code (the slice is not projected into the children)

What do you think about this test case? (Also OK to punt with a comment in the appropriate location that we arenot calculating correct memory usage of sliced lists/structs?)

I have added array_ref.slice(0, 1) test below and it gave correct result.

Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337

It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396

...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.

I'd like to leave get_array_data_memory_size as is and log a ticket. #519

And also added a XFAIL test case for list array slices.

paleolimbot · 2026-01-15T04:16:12Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+    let mut result: usize = 0;
+    let array_data_type = array_data.data_type();
+
+    if matches!(array_data_type, DataType::BinaryView | DataType::Utf8View) {


Got it! I suggested a comment so that future me also doesn't think that 🙂

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Kontinuation · 2026-01-15T05:05:33Z

I have added the requested tests and also renamed the private function count_array_data_memory_size to get_array_data_memory_size since the original name looked awful.

paleolimbot

Apologies for the nits on offset/length (I still have nightmares from implementing nested offsets in nanoarrow). I think this is still not right for some cases but feel free to file an issue and follow up later.

paleolimbot · 2026-01-15T15:25:26Z

rust/sedona-spatial-join/src/utils/arrow_utils.rs

+        let array_ref: ArrayRef = Arc::new(struct_array);
+        let size = get_array_memory_size(&array_ref).unwrap();
+        // 83 (StringView) + 1 (Boolean values) = 84
+        assert_eq!(size, 84);


Ah, arrow-rs propagates the slice for a StructArray on slice(): https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/struct_array.rs.html#337

It doesn't do this for list arrays: https://docs.rs/arrow-array/57.2.0/src/arrow_array/array/list_array.rs.html#396

...and possibly not for some other types like dictionaries, list views, or run-end-encoded. I think it's OK to punt on all of that but perhaps open an issue and link to it here in case this comes up.

More accurate batch in-memory size estimation

60bc62d

Kontinuation requested a review from Copilot January 14, 2026 15:29

Copilot AI reviewed Jan 14, 2026

View reviewed changes

Kontinuation marked this pull request as ready for review January 14, 2026 16:34

paleolimbot reviewed Jan 14, 2026

View reviewed changes

paleolimbot reviewed Jan 15, 2026

View reviewed changes

Kontinuation and others added 3 commits January 15, 2026 12:35

Apply suggestions from code review

4e591bf

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update rust/sedona-spatial-join/src/utils/arrow_utils.rs

a66496d

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Fix review comments

16c2e78

paleolimbot approved these changes Jan 15, 2026

View reviewed changes

Kontinuation mentioned this pull request Jan 15, 2026

get_array_data_memory_size does not return correct results for listarray slices #519

Open

Kontinuation added 2 commits January 15, 2026 23:38

Add a failing test case for estimating the size of slices of listarray

2a685ec

fix clippy

1919d02

Kontinuation merged commit 01f56a9 into apache:main Jan 16, 2026
15 checks passed

	let views = &array_data.buffer::<u128>(0)[..array_data.len()];
	let views = &array_data.buffer::<u128>(0)[array_data.offset()..array_data.len()];

chore(rust/sedona-spatial-join): More accurate batch in-memory size estimation #515

chore(rust/sedona-spatial-join): More accurate batch in-memory size estimation #515

Uh oh!

Conversation

Kontinuation commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kontinuation Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kontinuation commented Jan 15, 2026

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kontinuation Jan 15, 2026 •

edited

Loading