-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-39815: [C++] Document and micro-optimize ChunkResolver::Resolve() #39817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reproduction:
// on the first out-of-bounds query, chunks.size() is cached as
// cached_chunk_.
Resolve(chunked_array->length());
// on the second out-of-bounds query, chunks.size() is loaded from
// cached_chunk_ and...
Resolve(chunked_array->length());
...even though offsets[cached_chunk] is a valid access because
offsets.size() == chunks.size()+1, offsets[cached_chunk + 1] is not
because that's is equivalent to chunks.size() + 2 which is out of bounds
for offsets.
This allows callers to keep the cached chunk index hint in a local variable (register) instead of relying on the in-memory cached_chunk_ member variable of ChunkResolver.
|
Benchmarks of Ideas that were tried and didn't make a difference or made throughput worse:
[1] |
|
+1 on the principle, looks like a nice improvement. |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit a392353. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
| const auto left_loc = left_resolver_.Resolve(left); | ||
| const auto right_loc = right_resolver_.Resolve(right); | ||
| left_loc = | ||
| left_resolver_.ResolveWithChunkIndexHint(left, left_loc.chunk_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this is to avoid all atomic accesses, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory access really: keeping the ChunkLocation::chunk_index in registers throughout the loop -- of course it depends on the register allocation and inlining of ResolveWithChunkIndexHint. It improved the sort benchmarks slightly.
|
Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit a392353. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are nice chunked Sort and Rank performance improvements on the ARM64 benchmark machines. Thank you!
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2fa095c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 132 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…lve() (apache#39817) ### Rationale for this change There has been interest in improving operations on chunked-arrays and even though `ChunkResolver::Resolve()` is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation. ### What changes are included in this PR? - Documentation of invariants and behavior of functions - Multiple optimizations justified by microbenchmarks - Addition of a variation of `Resolve` that takes a hint as parameter - Fix of an out-of-bounds memory access that doesn't affect correctness (it can only reduce effectiveness of cache in very rare situations, but is nevertheless an issue) ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? - The `arrow::internal::ChunkResolver::Bisect()` function was `protected` and is now `private` with a different signature * Closes: apache#39815 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>


Rationale for this change
There has been interest in improving operations on chunked-arrays and even though
ChunkResolver::Resolve()is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation.What changes are included in this PR?
Resolvethat takes a hint as parameterAre these changes tested?
Yes, by existing tests.
Are there any user-facing changes?
arrow::internal::ChunkResolver::Bisect()function wasprotectedand is nowprivatewith a different signature