GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817

felipecrv · 2024-01-27T02:34:15Z

Rationale for this change

There has been interest in improving operations on chunked-arrays and even though ChunkResolver::Resolve() is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation.

What changes are included in this PR?

Documentation of invariants and behavior of functions
Multiple optimizations justified by microbenchmarks
Addition of a variation of Resolve that takes a hint as parameter
Fix of an out-of-bounds memory access that doesn't affect correctness (it can only reduce effectiveness of cache in very rare situations, but is nevertheless an issue)

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

The arrow::internal::ChunkResolver::Bisect() function was protected and is now private with a different signature

Closes: [C++] Document the guarantees of ChunkResolver::Resolve and minimize branching #39815

Reproduction: // on the first out-of-bounds query, chunks.size() is cached as // cached_chunk_. Resolve(chunked_array->length()); // on the second out-of-bounds query, chunks.size() is loaded from // cached_chunk_ and... Resolve(chunked_array->length()); ...even though offsets[cached_chunk] is a valid access because offsets.size() == chunks.size()+1, offsets[cached_chunk + 1] is not because that's is equivalent to chunks.size() + 2 which is out of bounds for offsets.

This allows callers to keep the cached chunk index hint in a local variable (register) instead of relying on the in-memory cached_chunk_ member variable of ChunkResolver.

felipecrv · 2024-01-27T02:42:16Z

Benchmarks of sort and rank on chunked arrays -- heavy users of ChunkResolver. 3 measurements after roughly every change to give an idea of level of noise. The purple group (bounds-check-fix) is when I fixed the out-of-bounds access bug that exists on main (not introduced by me in the optimizations). After that, the other two groups bring improvements that bring the throughput back to what was achieved before the bounds check.

Ideas that were tried and didn't make a difference or made throughput worse:

Removing the use of std::atomic completely, relaxed atomic operations are enough (which is good because that could introduce bugs)
Starting the Bisect on different ranges depending on the results of the branches

[1] ninja arrow-compute-vector-sort-benchmark && ./**/arrow-compute-vector-sort-benchmark --benchmark_filter="ChunkedArray(Sort|Rank).*Int64.*65536/100(/tiebreaker:2|$)" --benchmark_out_format=csv

felipecrv · 2024-01-27T02:42:59Z

@amol- @pitrou @js8544

pitrou · 2024-01-28T09:54:18Z

+1 on the principle, looks like a nice improvement.

pitrou · 2024-01-28T09:54:31Z

@ursabot please benchmark

ursabot · 2024-01-28T09:54:36Z

Benchmark runs are scheduled for commit a392353. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2024-01-28T09:56:00Z

cpp/src/arrow/compute/kernels/vector_sort.cc

-                 const auto left_loc = left_resolver_.Resolve(left);
-                 const auto right_loc = right_resolver_.Resolve(right);
+                 left_loc =
+                     left_resolver_.ResolveWithChunkIndexHint(left, left_loc.chunk_index);


The point of this is to avoid all atomic accesses, right?

Memory access really: keeping the ChunkLocation::chunk_index in registers throughout the loop -- of course it depends on the register allocation and inlining of ResolveWithChunkIndexHint. It improved the sort benchmarks slightly.

conbench-apache-arrow · 2024-01-28T12:23:00Z

Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit a392353.

There was 1 benchmark result indicating a performance regression:

Pull Request Run on ursa-i9-9960x at 2024-01-28 10:03:30Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-16, scale_factor=1

The full Conbench report has more details.

pitrou

There are nice chunked Sort and Rank performance improvements on the ARM64 benchmark machines. Thank you!

felipecrv · 2024-01-29T14:08:22Z

There are nice chunked Sort and Rank performance improvements on the ARM64 benchmark machines. Thank you!

Nice numbers indeed! 🚀

https://conbench.ursa.dev/compare/runs/1d47651f96e0468283cec3367674f7a4...5c59b6b2d4fe4a5ea599d5685fb5102b/

conbench-apache-arrow · 2024-01-29T16:34:29Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2fa095c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 132 possible false positives for unstable benchmarks that are known to sometimes produce them.

…lve() (apache#39817) ### Rationale for this change There has been interest in improving operations on chunked-arrays and even though `ChunkResolver::Resolve()` is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation. ### What changes are included in this PR? - Documentation of invariants and behavior of functions - Multiple optimizations justified by microbenchmarks - Addition of a variation of `Resolve` that takes a hint as parameter - Fix of an out-of-bounds memory access that doesn't affect correctness (it can only reduce effectiveness of cache in very rare situations, but is nevertheless an issue) ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? - The `arrow::internal::ChunkResolver::Bisect()` function was `protected` and is now `private` with a different signature * Closes: apache#39815 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

felipecrv added 9 commits January 26, 2024 23:20

chunk_resolver.h: Document all invariants and required pre-conditions

3175612

Cached binary search result operations only need atomicity

df62c75

Prepare Bisect to take sub-range parameters

2e17fed

Dispatch load ASAP

655f8d5

Remove the now unecessary num_offsets<=1 check

6a13c38

Remove first binary search branch

17e8b94

Add the ChunkResolver::ResolveWithChunkIndexHint() function

e5b2c1b

This allows callers to keep the cached chunk index hint in a local variable (register) instead of relying on the in-memory cached_chunk_ member variable of ChunkResolver.

sort: Use ResolveWithChunkIndexHint()

a392353

github-actions bot added Component: C++ awaiting review Awaiting review labels Jan 27, 2024

pitrou reviewed Jan 28, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 28, 2024

pitrou approved these changes Jan 29, 2024

View reviewed changes

felipecrv merged commit 2fa095c into apache:main Jan 29, 2024

felipecrv removed the awaiting committer review Awaiting committer review label Jan 29, 2024

felipecrv deleted the chunk_resolver branch January 29, 2024 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817

GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817

Uh oh!

felipecrv commented Jan 27, 2024 •

edited by github-actions bot

Loading

Uh oh!

felipecrv commented Jan 27, 2024

Uh oh!

felipecrv commented Jan 27, 2024

Uh oh!

pitrou commented Jan 28, 2024

Uh oh!

pitrou commented Jan 28, 2024

Uh oh!

ursabot commented Jan 28, 2024

Uh oh!

pitrou Jan 28, 2024

Uh oh!

felipecrv Jan 28, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 28, 2024

Uh oh!

pitrou left a comment

Uh oh!

felipecrv commented Jan 29, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817

GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817

Uh oh!

Conversation

felipecrv commented Jan 27, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

felipecrv commented Jan 27, 2024

Uh oh!

felipecrv commented Jan 27, 2024

Uh oh!

pitrou commented Jan 28, 2024

Uh oh!

pitrou commented Jan 28, 2024

Uh oh!

ursabot commented Jan 28, 2024

Uh oh!

pitrou Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

felipecrv Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

conbench-apache-arrow bot commented Jan 28, 2024

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

felipecrv commented Jan 29, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felipecrv commented Jan 27, 2024 •

edited by github-actions bot

Loading