Skip to content

GH-32276: [C++][FlightRPC] Add option to align RecordBatch buffers given to IPC reader#44279

Merged
lidavidm merged 22 commits intoapache:mainfrom
EnricoMi:flight-client-align-buffers
May 19, 2025
Merged

GH-32276: [C++][FlightRPC] Add option to align RecordBatch buffers given to IPC reader#44279
lidavidm merged 22 commits intoapache:mainfrom
EnricoMi:flight-client-align-buffers

Conversation

@EnricoMi
Copy link
Copy Markdown
Collaborator

@EnricoMi EnricoMi commented Oct 1, 2024

Rationale for this change

Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: #43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

What changes are included in this PR?

This adds option arrow::ipc::IpcReadOptions.ensure_alignment of type arrow::ipc::Alignment to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of align_buffers in arrow-rs (apache/arrow-rs#4681).

Are these changes tested?

Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in #43552.

Are there any user-facing changes?

Adds option IpcReadOptions.ensure_alignment and enum type Alignment.

@EnricoMi EnricoMi changed the title GH-32276: [C++][Flight] Align RecordBatch buffers retrieved via IPC GH-32276: [C++][FlightRPC] Align RecordBatch buffers retrieved via IPC Oct 1, 2024
@github-actions github-actions bot added the awaiting review Awaiting review label Oct 1, 2024
@EnricoMi EnricoMi changed the title GH-32276: [C++][FlightRPC] Align RecordBatch buffers retrieved via IPC GH-32276: [C++][FlightRPC] Align RecordBatch buffers given to IPC Oct 1, 2024
@EnricoMi
Copy link
Copy Markdown
Collaborator Author

EnricoMi commented Oct 2, 2024

@pitrou do you think this fix is viable?

@rok rok requested a review from lidavidm October 2, 2024 10:23
Copy link
Copy Markdown
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. This seems reasonable to me at a quick glance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put this include with the rest of the Arrow includes (and use quotes to be consistent)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could iterate with for (auto& child : child_data) and avoid the explicit index?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 6, 2024
@EnricoMi EnricoMi force-pushed the flight-client-align-buffers branch from 77cc70a to a5d9e2d Compare October 7, 2024 10:57
@github-actions github-actions bot added Component: C++ Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 7, 2024
@EnricoMi
Copy link
Copy Markdown
Collaborator Author

EnricoMi commented Oct 7, 2024

While attempting to write some unit tests I found there is util::EnsureAlignment:

Result<std::shared_ptr<ArrayData>> EnsureAlignment(std::shared_ptr<ArrayData> array_data,
int64_t alignment,
MemoryPool* memory_pool) {
if (!CheckAlignment(*array_data, alignment)) {
std::vector<std::shared_ptr<Buffer>> buffers = array_data->buffers;
Type::type type_id = GetTypeForBuffers(*array_data);
for (size_t i = 0; i < buffers.size(); ++i) {
if (buffers[i]) {
int64_t expected_alignment = alignment;
if (alignment == kValueAlignment) {
expected_alignment =
RequiredValueAlignmentForBuffer(type_id, static_cast<int>(i));
}
ARROW_ASSIGN_OR_RAISE(
buffers[i],
EnsureAlignment(std::move(buffers[i]), expected_alignment, memory_pool));
}
}
for (auto& it : array_data->child_data) {
ARROW_ASSIGN_OR_RAISE(it, EnsureAlignment(std::move(it), alignment, memory_pool));
}
if (array_data->type->id() == Type::DICTIONARY) {
ARROW_ASSIGN_OR_RAISE(
array_data->dictionary,
EnsureAlignment(std::move(array_data->dictionary), alignment, memory_pool));
}
auto new_array_data = ArrayData::Make(
array_data->type, array_data->length, std::move(buffers), array_data->child_data,
array_data->dictionary, array_data->GetNullCount(), array_data->offset);
return new_array_data;
} else {
return array_data;
}
}

I will try to reuse that method rather than re-implementing it. There is also test infrastructure for misaligned array data.

Copy link
Copy Markdown
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase? Tests appear to be failing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the memory pool in context.options.memory_pool?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Ideally we should use the buffer's memory manager rather than the default CPU manager:

static std::unique_ptr<PoolBuffer> MakeUnique(MemoryPool* pool, int64_t alignment) {
std::shared_ptr<MemoryManager> mm;
if (pool == nullptr) {
pool = default_memory_pool();
mm = default_cpu_memory_manager();
} else {
mm = CPUDevice::memory_manager(pool);
}
return std::make_unique<PoolBuffer>(std::move(mm), pool, alignment);
}

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Oct 14, 2024
@EnricoMi EnricoMi force-pushed the flight-client-align-buffers branch 2 times, most recently from 960cb21 to 9909f13 Compare October 22, 2024 08:55
@EnricoMi
Copy link
Copy Markdown
Collaborator Author

EnricoMi commented Oct 22, 2024

Test arrow-ipc-read-write-test fails with a SIGSEGV in line 45 of

Type::type GetTypeForBuffers(const ArrayData& array) {
Type::type type_id = array.type->storage_id();
if (type_id == Type::DICTIONARY) {
return ::arrow::internal::checked_pointer_cast<DictionaryType>(array.type)
->index_type()
->id();
}
return type_id;
}

https://github.com/apache/arrow/actions/runs/11462607112/job/31894398411?pr=44279#step:13:1548

That test complains a lot about /home/enrico/Work/git/arrow/cpp/src/arrow/status.cc:152: Invalid: RequiredValueAlignmentForBuffer called with invalid type id 29 due to RequiredValueAlignmentForBuffer being called for DICTIONARY data type in line 62:

bool CheckSelfAlignment(const ArrayData& array, int64_t alignment) {
if (alignment == kValueAlignment) {
Type::type type_id = GetTypeForBuffers(array);
for (std::size_t i = 0; i < array.buffers.size(); i++) {
if (array.buffers[i]) {
int expected_alignment =
RequiredValueAlignmentForBuffer(type_id, static_cast<int>(i));
if (!CheckAlignment(*array.buffers[i], expected_alignment)) {
return false;
}
}
}
} else {
for (const auto& buffer : array.buffers) {
if (buffer) {
if (!CheckAlignment(*buffer, alignment)) return false;
}
}
}
return true;
}

Looks like util::EnsureAlignment and util::CheckAlignment is not quite ready for this use case.

@EnricoMi EnricoMi force-pushed the flight-client-align-buffers branch from f2dae5b to d1219d2 Compare October 22, 2024 15:50
@EnricoMi
Copy link
Copy Markdown
Collaborator Author

@westonpace @sanjibansg
Any pointers in which situations array.type can be null (causing GetDtypeForBuffers to segfault)?
Any rational on why RequiredValueAlignmentForBuffer does not support DICTIONARY data type?
#44279 (comment)

@rok rok requested review from felipecrv and lidavidm November 7, 2024 11:53
@EnricoMi EnricoMi force-pushed the flight-client-align-buffers branch from 63d7251 to bf1a7e9 Compare May 8, 2025 08:45
@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 15, 2025

Thanks @EnricoMi . This looks good to me except for the nit above.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@lidavidm lidavidm merged commit f45594d into apache:main May 19, 2025
35 of 36 checks passed
@lidavidm lidavidm removed the awaiting change review Awaiting change review label May 19, 2025
@lidavidm
Copy link
Copy Markdown
Member

Thanks @EnricoMi!

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f45594d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 15 possible false positives for unstable benchmarks that are known to sometimes produce them.

EnricoMi added a commit to EnricoMi/arrow that referenced this pull request May 23, 2025
…ers given to IPC reader (apache#44279)

### Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: apache#43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

### What changes are included in this PR?
This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (apache/arrow-rs#4681).

### Are these changes tested?
Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in apache#43552.

### Are there any user-facing changes?
Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`.

* GitHub Issue: apache#32276

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: David Li <li.davidm96@gmail.com>
EnricoMi added a commit to G-Research/arrow that referenced this pull request May 23, 2025
…ers given to IPC reader (apache#44279) (#6)

### Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: apache#43552.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

### What changes are included in this PR?
This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment.

Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (apache/arrow-rs#4681).

### Are these changes tested?
Configuration flag tested in unit test. Integration test with Flight server.
Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in apache#43552.

### Are there any user-facing changes?
Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`.

* GitHub Issue: apache#32276

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: David Li <li.davidm96@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants