Skip to content

Conversation

@felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Jan 27, 2024

Rationale for this change

There has been interest in improving operations on chunked-arrays and even though ChunkResolver::Resolve() is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation.

What changes are included in this PR?

  • Documentation of invariants and behavior of functions
  • Multiple optimizations justified by microbenchmarks
  • Addition of a variation of Resolve that takes a hint as parameter
  • Fix of an out-of-bounds memory access that doesn't affect correctness (it can only reduce effectiveness of cache in very rare situations, but is nevertheless an issue)

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

  • The arrow::internal::ChunkResolver::Bisect() function was protected and is now private with a different signature

Reproduction:

      // on the first out-of-bounds query, chunks.size() is cached as
      // cached_chunk_.
      Resolve(chunked_array->length());
      // on the second out-of-bounds query, chunks.size() is loaded from
      // cached_chunk_ and...
      Resolve(chunked_array->length());

...even though offsets[cached_chunk] is a valid access because
offsets.size() == chunks.size()+1, offsets[cached_chunk + 1] is not
because that's is equivalent to chunks.size() + 2 which is out of bounds
for offsets.
This allows callers to keep the cached chunk index hint in a local
variable (register) instead of relying on the in-memory cached_chunk_
member variable of ChunkResolver.
@felipecrv
Copy link
Contributor Author

Benchmarks of sort and rank on chunked arrays -- heavy users of ChunkResolver. 3 measurements after roughly every change to give an idea of level of noise. The purple group (bounds-check-fix) is when I fixed the out-of-bounds access bug that exists on main (not introduced by me in the optimizations). After that, the other two groups bring improvements that bring the throughput back to what was achieved before the bounds check.

resolve

Ideas that were tried and didn't make a difference or made throughput worse:

  • Removing the use of std::atomic completely, relaxed atomic operations are enough (which is good because that could introduce bugs)
  • Starting the Bisect on different ranges depending on the results of the branches

[1] ninja arrow-compute-vector-sort-benchmark && ./**/arrow-compute-vector-sort-benchmark --benchmark_filter="ChunkedArray(Sort|Rank).*Int64.*65536/100(/tiebreaker:2|$)" --benchmark_out_format=csv

@felipecrv
Copy link
Contributor Author

@amol- @pitrou @js8544

@pitrou
Copy link
Member

pitrou commented Jan 28, 2024

+1 on the principle, looks like a nice improvement.

@pitrou
Copy link
Member

pitrou commented Jan 28, 2024

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jan 28, 2024

Benchmark runs are scheduled for commit a392353. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

const auto left_loc = left_resolver_.Resolve(left);
const auto right_loc = right_resolver_.Resolve(right);
left_loc =
left_resolver_.ResolveWithChunkIndexHint(left, left_loc.chunk_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this is to avoid all atomic accesses, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory access really: keeping the ChunkLocation::chunk_index in registers throughout the loop -- of course it depends on the register allocation and inlining of ResolveWithChunkIndexHint. It improved the sort benchmarks slightly.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 28, 2024
@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit a392353.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are nice chunked Sort and Rank performance improvements on the ARM64 benchmark machines. Thank you!

@felipecrv
Copy link
Contributor Author

There are nice chunked Sort and Rank performance improvements on the ARM64 benchmark machines. Thank you!

Nice numbers indeed! 🚀

Screenshot 2024-01-29 at 11 05 10

https://conbench.ursa.dev/compare/runs/1d47651f96e0468283cec3367674f7a4...5c59b6b2d4fe4a5ea599d5685fb5102b/

@felipecrv felipecrv merged commit 2fa095c into apache:main Jan 29, 2024
@felipecrv felipecrv removed the awaiting committer review Awaiting committer review label Jan 29, 2024
@felipecrv felipecrv deleted the chunk_resolver branch January 29, 2024 14:09
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2fa095c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 132 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…lve() (apache#39817)

### Rationale for this change

There has been interest in improving operations on chunked-arrays and even though `ChunkResolver::Resolve()` is not a big contributor in most kernels, the fact that it can be used from tight loops warrants careful attention to branch prediction and memory effects of its implementation.

### What changes are included in this PR?

 - Documentation of invariants and behavior of functions
 - Multiple optimizations justified by microbenchmarks
 - Addition of a variation of `Resolve` that takes a hint as parameter
 - Fix of an out-of-bounds memory access that doesn't affect correctness (it can only reduce effectiveness of cache in very rare situations, but is nevertheless an issue)

### Are these changes tested?

Yes, by existing tests.

### Are there any user-facing changes?

 - The `arrow::internal::ChunkResolver::Bisect()` function was `protected` and is now `private` with a different signature
* Closes: apache#39815

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Document the guarantees of ChunkResolver::Resolve and minimize branching

3 participants