Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jul 17, 2022

This completes the porting to use ExecSpan everywhere. I also changed the ExecBatchIterator benchmarks to use ExecSpan to show the performance improvement in input splitting that we've talked about in the past:

Splitting inputs into small ExecSpan:

------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
BM_ExecSpanIterator/1024      205671 ns       205667 ns         3395 items_per_second=4.86223k/s
BM_ExecSpanIterator/4096       54749 ns        54750 ns        13121 items_per_second=18.265k/s
BM_ExecSpanIterator/16384      15979 ns        15979 ns        42628 items_per_second=62.5824k/s
BM_ExecSpanIterator/65536       5597 ns         5597 ns       125099 items_per_second=178.668k/s

Splitting inputs into small ExecBatch:

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_ExecBatchIterator/1024    17163432 ns     17163171 ns           41 items_per_second=58.2643/s
BM_ExecBatchIterator/4096     4243467 ns      4243316 ns          163 items_per_second=235.665/s
BM_ExecBatchIterator/16384    1093680 ns      1093638 ns          620 items_per_second=914.38/s
BM_ExecBatchIterator/65536     272451 ns       272435 ns         2584 items_per_second=3.6706k/s

Because the input in this benchmark has 1M elements, this shows that splitting into 1024 chunks of size 1024 adds only 0.2ms of overhead with ExecSpanIterator versus 17.16ms of overhead with ExecBatchIterator (> 80x improvement).

This won't by itself do much to impact performance in Acero but things for the community to explore in the future are the following (this work that I've been doing has been a precondition to consider this):

  • A leaner ExecuteScalarExpression implementation that reuses temporary allocations (ARROW-16758)
  • Parallel expression evaluation
  • Better defining morsel (~1M elements) versus task (~1K elements) granularity in execution
  • Work stealing so that we don't "hog" the thread pools, and we keep the work pinned to a particular CPU core if there are other things going on at the same time

@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented Jul 17, 2022

The only CI failures are the linkage issues related to protobuf/otel

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, just a couple minor comments.

wesm added 4 commits July 19, 2022 08:17
Fix some more stuff

Remove unused scalar path

More refactoring

Get everything compiling again

Fix some more bugs

Fix another bug

Remove dead code

Fix ArraySpan add/set offset logic

Cleaner ArraySpan slicing logic

Revert some files
@wesm wesm force-pushed the execspan-other-kernels branch from b182b8c to df1a92e Compare July 19, 2022 13:40
@wesm
Copy link
Member Author

wesm commented Jul 19, 2022

Thanks for the review. Will merge when CI green

@wesm
Copy link
Member Author

wesm commented Jul 19, 2022

Merging. The CI issues look like assorted flakiness to me

@wesm wesm merged commit 4d931ff into apache:master Jul 19, 2022
@ursabot
Copy link

ursabot commented Jul 20, 2022

Benchmark runs are scheduled for baseline = c445243 and contender = 4d931ff. 4d931ff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.19% ⬆️0.21%] test-mac-arm
[Failed] ursa-i9-9960x
[Finished ⬇️1.03% ⬆️0.43%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 4d931ff1 ec2-t3-xlarge-us-east-2
[Failed] 4d931ff1 test-mac-arm
[Failed] 4d931ff1 ursa-i9-9960x
[Finished] 4d931ff1 ursa-thinkcentre-m75q
[Failed] c445243a ec2-t3-xlarge-us-east-2
[Failed] c445243a test-mac-arm
[Failed] c445243a ursa-i9-9960x
[Finished] c445243a ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants