Skip to content

Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255)#4681

Merged
tustvold merged 6 commits intoapache:masterfrom
tustvold:handle-unaligned-ipc-buffers
Aug 16, 2023
Merged

Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255)#4681
tustvold merged 6 commits intoapache:masterfrom
tustvold:handle-unaligned-ipc-buffers

Conversation

@tustvold
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4255

Rationale for this change

This will allow zero-copy IPC (#4254), whilst also acting as an escape valve for issues like apache/arrow#32276 where arrow-cpp produces unaligned buffers

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Aug 11, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 11, 2023

/// Checks if given `Buffer` is properly aligned with `T`.
/// If not, copying the data and padded it for alignment.
fn get_aligned_buffer<T>(buffer: &Buffer, length: usize) -> Buffer {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally added because aarch64 has stronger alignment requirements for i128 than the 64-bit alignment commonly provided by flatbuffers, now this is handled systematically for all types

)));
}

if buffer.as_ptr().align_offset(*alignment) != 0 {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now actually verify this as part of ArrayData, instead of panicking when you try to create an array 🎉

buffers: vec![BufferSpec::FixedWidth { byte_width }],
buffers: vec![BufferSpec::FixedWidth {
byte_width: mem::size_of::<T>(),
alignment: mem::align_of::<T>(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consequence of this formulation is the alignment is architecture specific, I suspect this is fine, but it potentially worth highlighting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well let's hope nobody mmaps arrow data between different architectures w/o carefully thinking about alignment.

On that note: I wonder what the cost would be to heavily overalign buffers in general to to make them valid on all supported architectures. Like why not just align everything to 256bits?

Copy link
Copy Markdown
Contributor Author

@tustvold tustvold Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like why not just align everything to 256bits

We do, Rust doesn't 😄 And at least until the allocator API is stabilised there is no way to over-align a Vec

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because native alignment is what the arrow-rs implementation requires, for many types this is less than 64-bits. We don't, for example, care if the values buffer of a StringArray is aligned at all. On the flip side, some architectures, such as aarch64 require more than 64-bit alignment for i128, and so we require that also

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation added in ff6af65 make the behavior in this PR much clearer.

Thank you

let array_data = unsafe {
ArrayData::builder(DataType::Int32)
.add_buffer(buf2)
.build_unchecked()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use build_unchecked here, as ArrayData will now complain about the alignment 🎉

@tustvold tustvold changed the title Automatically align misaligned IPC buffers (#4255) Validate ArrayData Bufer Alignment and Automatically Align IPC buffers (#4255) Aug 11, 2023
@tustvold tustvold changed the title Validate ArrayData Bufer Alignment and Automatically Align IPC buffers (#4255) Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255) Aug 11, 2023

if buffer.as_ptr().align_offset(*alignment) != 0 {
return Err(ArrowError::InvalidArgumentError(format!(
"Misaligned buffers[{i}] in array of type {:?}, expected {alignment}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful for debugging to include the actual alignment in addition to the expected value.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Aug 11, 2023

It appears something similar was done in arrow2 recently: jorgecarleitao/arrow2#1535

for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) {
if let BufferSpec::FixedWidth { alignment, .. } = spec {
if buffer.as_ptr().align_offset(*alignment) != 0 {
*buffer = Buffer::from_slice_ref(buffer.as_ref())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe log a debug message when this happens? Some users might find it confusing that the buffers are silently copied / re-aligned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have a logging setup #1495

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we consider to use a feature for automatic alignment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this PR, if a user is creating ArrayData directly, they have to explicitly call align_buffer in order to ensure the data is aligned and the documentation says it may copy the data.

If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.

As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior

@tustvold
Copy link
Copy Markdown
Contributor Author

t appears something similar was done in arrow2 recently: jorgecarleitao/arrow2#1535

That is for FFI, which I don't intend to make automatic, as it would likely be very surprising for users. Instead the fact we provide an escape valve in the form of ArrayData::align_buffers I think is sufficient

#[test]
#[should_panic(expected = "integer overflow computing min buffer size")]
#[should_panic(
expected = "Need at least 18446744073709551615 bytes in buffers[0] in array of type Int64, but got 8"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an allocation cannot be larger than isize, we might as well just saturate at usize::MAX, we're going to fail anyway

///
/// This can be useful for when interacting with data sent over IPC or FFI, that may
/// not meet the minimum alignment requirements
pub fn align_buffers(&mut self) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so can we use this to address unaligned imported arrays issue like #4431?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It seems that it might take longer for a fix at Java Arrow implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also please document clearly

  1. What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
  2. What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)

@tustvold tustvold requested a review from alamb August 15, 2023 07:49
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. I am also going to run the tests on an ARM processor to verify it is ok. Given the choice between copying to realign and runtime error, I think copying to realign is a reasonable behavior change

As I mentioned previously, it may be useful to add a "error rather than copy mode" but we can do that if anyone else thinks that is valuable

I think the documentation could be more clear about what alignment we are referring to and what happens when data is not aligned.

But overall thank you @tustvold -- I think this is a significant step forward

for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) {
if let BufferSpec::FixedWidth { alignment, .. } = spec {
if buffer.as_ptr().align_offset(*alignment) != 0 {
*buffer = Buffer::from_slice_ref(buffer.as_ref())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this PR, if a user is creating ArrayData directly, they have to explicitly call align_buffer in order to ensure the data is aligned and the documentation says it may copy the data.

If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.

As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior

pub enum BufferSpec {
/// each element has a fixed width
FixedWidth { byte_width: usize },
FixedWidth { byte_width: usize, alignment: usize },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document here what alignment represents (e.g. required pointer alignment for buffers of this type)

buffers: vec![BufferSpec::FixedWidth { byte_width }],
buffers: vec![BufferSpec::FixedWidth {
byte_width: mem::size_of::<T>(),
alignment: mem::align_of::<T>(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?

///
/// This can be useful for when interacting with data sent over IPC or FFI, that may
/// not meet the minimum alignment requirements
pub fn align_buffers(&mut self) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also please document clearly

  1. What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
  2. What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)

}

#[test]
fn test_unaligned() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test without the changes in the PR and it fails like this:

---- reader::tests::test_unaligned stdout ----
thread 'reader::tests::test_unaligned' panicked at 'Memory pointer is not aligned with the specified scalar type', /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/scalar.rs:125:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also verified this PR and all tests pass on an ARM processor ("Ampere Altra" on GCP) 👍

@tustvold
Copy link
Copy Markdown
Contributor Author

I've added a number of additional doc comments to hopefully clarify what alignment means in this context

@tustvold tustvold merged commit 197c425 into apache:master Aug 16, 2023
lidavidm added a commit to apache/arrow that referenced this pull request May 19, 2025
…ven to IPC reader (#44279)

### Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: #43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

### What changes are included in this PR?
This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (apache/arrow-rs#4681).

### Are these changes tested?
Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in #43552.

### Are there any user-facing changes?
Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`.

* GitHub Issue: #32276

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: David Li <li.davidm96@gmail.com>
EnricoMi added a commit to EnricoMi/arrow that referenced this pull request May 23, 2025
…ers given to IPC reader (apache#44279)

### Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: apache#43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

### What changes are included in this PR?
This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (apache/arrow-rs#4681).

### Are these changes tested?
Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in apache#43552.

### Are there any user-facing changes?
Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`.

* GitHub Issue: apache#32276

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: David Li <li.davidm96@gmail.com>
EnricoMi added a commit to G-Research/arrow that referenced this pull request May 23, 2025
…ers given to IPC reader (apache#44279) (#6)

### Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: apache#43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

### What changes are included in this PR?
This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (apache/arrow-rs#4681).

### Are these changes tested?
Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in apache#43552.

### Are there any user-facing changes?
Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`.

* GitHub Issue: apache#32276

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: David Li <li.davidm96@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle Misaligned IPC Buffers

4 participants