Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 9, 2020

I was going to work on Filter in this PR but it got to be too much to review so I'll tackle that in a separate PR.

Summary of what's in this PR:

  • Uses BitBlockCounter to speed up processing of mostly-not-null indices vectors
  • Performance of primitive takes improved ~2-3x across the board. Builder classes are no longer used for primitive types.
  • List-based takes seem about 4x faster based on the (limited) benchmarks
  • Size of vector_take.cc.o reduced from 5.9 MB to 495 KB on -O3 build with clang-8. Compilation time is correspondingly reduced. The refactor for the new kernels framework I think had increased the code size in this module beyond what was in 0.17.x.
  • Kernels for primitive types of same size are reused
  • Signed/unsigned indices are processed using unsigned integer code paths after they've been boundschecked (so we know that the signed ints have no negative values)
  • Adds new kernel input type checking rules so that the number of registered take kernels has gone from over 300 to only 13. This means faster dispatching, too.
  • Does vectorized boundschecking
  • Take no longer uses kernels/vector_selection_internal.h, but it's still used by Filter. I plan to delete it after completing the new Filter implementation
  • Fixes to the prior benchmarking PR

Note: support for doing Take with unions has been temporarily disabled. Since there is so little code in the codebase that deals with unions at the moment I felt it would be better to address this in follow up work. Here are the currently available kernels:

[VectorKernel<(array[primitive], array[integer]) -> computed>,
 VectorKernel<(array[binary-like], array[integer]) -> computed>,
 VectorKernel<(array[large-binary-like], array[integer]) -> computed>,
 VectorKernel<(array[Type::FIXED_SIZE_BINARY], array[integer]) -> computed>,
 VectorKernel<(array[null], array[integer]) -> computed>,
 VectorKernel<(array[Type::DECIMAL], array[integer]) -> computed>,
 VectorKernel<(array[Type::DICTIONARY], array[integer]) -> computed>,
 VectorKernel<(array[Type::EXTENSION], array[integer]) -> computed>,
 VectorKernel<(array[Type::LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::LARGE_LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::FIXED_SIZE_LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::STRUCT], array[integer]) -> computed>,
 VectorKernel<(array[Type::MAP], array[integer]) -> computed>]

There's some code duplication that can be improved so I think things can be improved in follow up PRs using the benchmarks and code size metrics as a strict guideline for making changes.

@wesm wesm changed the title ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation Jun 9, 2020
@github-actions
Copy link

github-actions bot commented Jun 9, 2020

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

I'll fix the unit tests tomorrow. There's a lot of benchmarks but here are the take-int64 ones:

before

----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
TakeInt64RandomIndicesNoNulls/32768/1000          23908 ns      23908 ns      29764 null_percent=0.1 size=32.768k   1.27646GB/s
TakeInt64RandomIndicesNoNulls/32768/100           24202 ns      24202 ns      29127 null_percent=1 size=32.768k   1.26094GB/s
TakeInt64RandomIndicesNoNulls/32768/50            24959 ns      24959 ns      28736 null_percent=2 size=32.768k   1.22271GB/s
TakeInt64RandomIndicesNoNulls/32768/10            25730 ns      25730 ns      27319 null_percent=10 size=32.768k   1.18607GB/s
TakeInt64RandomIndicesNoNulls/32768/1             20299 ns      20299 ns      34665 null_percent=100 size=32.768k   1.50342GB/s
TakeInt64RandomIndicesNoNulls/32768/0             20151 ns      20152 ns      34526 null_percent=0 size=32.768k    1.5144GB/s
TakeInt64RandomIndicesNoNulls/1048576/1000       831642 ns     831646 ns        835 null_percent=0.1 size=1048.58k   1.17425GB/s
TakeInt64RandomIndicesNoNulls/1048576/100        822095 ns     822099 ns        859 null_percent=1 size=1048.58k   1.18789GB/s
TakeInt64RandomIndicesNoNulls/1048576/50         822710 ns     822720 ns        853 null_percent=2 size=1048.58k   1.18699GB/s
TakeInt64RandomIndicesNoNulls/1048576/10         897409 ns     897420 ns        785 null_percent=10 size=1048.58k   1114.31MB/s
TakeInt64RandomIndicesNoNulls/1048576/1          539675 ns     539682 ns       1311 null_percent=100 size=1048.58k   1.80951GB/s
TakeInt64RandomIndicesNoNulls/1048576/0          722993 ns     722999 ns        990 null_percent=0 size=1048.58k   1.35071GB/s
TakeInt64RandomIndicesWithNulls/32768/1000        23156 ns      23156 ns      30099 null_percent=0.1 size=32.768k    1.3179GB/s
TakeInt64RandomIndicesWithNulls/32768/100         23752 ns      23753 ns      29571 null_percent=1 size=32.768k   1.28481GB/s
TakeInt64RandomIndicesWithNulls/32768/50          23847 ns      23847 ns      29513 null_percent=2 size=32.768k    1.2797GB/s
TakeInt64RandomIndicesWithNulls/32768/10          24102 ns      24102 ns      28948 null_percent=10 size=32.768k    1.2662GB/s
TakeInt64RandomIndicesWithNulls/32768/1           17271 ns      17271 ns      40594 null_percent=100 size=32.768k   1.76697GB/s
TakeInt64RandomIndicesWithNulls/32768/0           19940 ns      19940 ns      35293 null_percent=0 size=32.768k   1.53043GB/s
TakeInt64RandomIndicesWithNulls/1048576/1000     864066 ns     864067 ns        813 null_percent=0.1 size=1048.58k   1.13019GB/s
TakeInt64RandomIndicesWithNulls/1048576/100      871374 ns     871377 ns        807 null_percent=1 size=1048.58k   1.12071GB/s
TakeInt64RandomIndicesWithNulls/1048576/50       878898 ns     878904 ns        776 null_percent=2 size=1048.58k   1.11111GB/s
TakeInt64RandomIndicesWithNulls/1048576/10       992572 ns     992581 ns        718 null_percent=10 size=1048.58k   1007.47MB/s
TakeInt64RandomIndicesWithNulls/1048576/1        451244 ns     451246 ns       1513 null_percent=100 size=1048.58k   2.16415GB/s
TakeInt64RandomIndicesWithNulls/1048576/0        713145 ns     713151 ns        983 null_percent=0 size=1048.58k   1.36936GB/s
TakeInt64MonotonicIndices/32768/1000              23928 ns      23928 ns      28710 null_percent=0.1 size=32.768k   1.27538GB/s
TakeInt64MonotonicIndices/32768/100               24378 ns      24379 ns      28777 null_percent=1 size=32.768k   1.25182GB/s
TakeInt64MonotonicIndices/32768/50                24459 ns      24459 ns      28711 null_percent=2 size=32.768k   1.24771GB/s
TakeInt64MonotonicIndices/32768/10                25479 ns      25479 ns      27342 null_percent=10 size=32.768k   1.19774GB/s
TakeInt64MonotonicIndices/32768/1                 19608 ns      19608 ns      35646 null_percent=100 size=32.768k   1.55641GB/s
TakeInt64MonotonicIndices/32768/0                 19782 ns      19782 ns      35704 null_percent=0 size=32.768k   1.54267GB/s
TakeInt64MonotonicIndices/1048576/1000           663079 ns     663081 ns       1056 null_percent=0.1 size=1048.58k   1.47276GB/s
TakeInt64MonotonicIndices/1048576/100            679547 ns     679553 ns       1044 null_percent=1 size=1048.58k   1.43707GB/s
TakeInt64MonotonicIndices/1048576/50             687064 ns     687063 ns       1026 null_percent=2 size=1048.58k   1.42136GB/s
TakeInt64MonotonicIndices/1048576/10             759665 ns     759665 ns        924 null_percent=10 size=1048.58k   1.28552GB/s
TakeInt64MonotonicIndices/1048576/1              535625 ns     535630 ns       1303 null_percent=100 size=1048.58k   1.82321GB/s
TakeInt64MonotonicIndices/1048576/0              537183 ns     537185 ns       1327 null_percent=0 size=1048.58k   1.81793GB/s

after

----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
TakeInt64RandomIndicesNoNulls/32768/1000          10195 ns      10195 ns      71377 null_percent=0.1 size=32.768k   2.99348GB/s
TakeInt64RandomIndicesNoNulls/32768/100           10456 ns      10456 ns      67748 null_percent=1 size=32.768k    2.9187GB/s
TakeInt64RandomIndicesNoNulls/32768/50            10479 ns      10479 ns      68004 null_percent=2 size=32.768k   2.91215GB/s
TakeInt64RandomIndicesNoNulls/32768/10            10730 ns      10730 ns      66188 null_percent=10 size=32.768k   2.84409GB/s
TakeInt64RandomIndicesNoNulls/32768/1              5632 ns       5632 ns     122772 null_percent=100 size=32.768k   5.41893GB/s
TakeInt64RandomIndicesNoNulls/32768/0              4726 ns       4726 ns     145938 null_percent=0 size=32.768k   6.45768GB/s
TakeInt64RandomIndicesNoNulls/1048576/1000       370236 ns     370236 ns       1893 null_percent=0.1 size=1048.58k   2.63768GB/s
TakeInt64RandomIndicesNoNulls/1048576/100        371821 ns     371818 ns       1868 null_percent=1 size=1048.58k   2.62645GB/s
TakeInt64RandomIndicesNoNulls/1048576/50         377290 ns     377286 ns       1868 null_percent=2 size=1048.58k   2.58839GB/s
TakeInt64RandomIndicesNoNulls/1048576/10         420583 ns     420581 ns       1668 null_percent=10 size=1048.58k   2.32194GB/s
TakeInt64RandomIndicesNoNulls/1048576/1          120646 ns     120646 ns       5825 null_percent=100 size=1048.58k   8.09444GB/s
TakeInt64RandomIndicesNoNulls/1048576/0          185643 ns     185642 ns       3736 null_percent=0 size=1048.58k   5.26045GB/s
TakeInt64RandomIndicesWithNulls/32768/1000        11042 ns      11042 ns      62278 null_percent=0.1 size=32.768k    2.7637GB/s
TakeInt64RandomIndicesWithNulls/32768/100         13464 ns      13464 ns      52471 null_percent=1 size=32.768k    2.2666GB/s
TakeInt64RandomIndicesWithNulls/32768/50          14991 ns      14991 ns      46472 null_percent=2 size=32.768k   2.03574GB/s
TakeInt64RandomIndicesWithNulls/32768/10          17244 ns      17244 ns      40158 null_percent=10 size=32.768k   1.76978GB/s
TakeInt64RandomIndicesWithNulls/32768/1            2366 ns       2366 ns     298987 null_percent=100 size=32.768k   12.8974GB/s
TakeInt64RandomIndicesWithNulls/32768/0            4564 ns       4564 ns     151272 null_percent=0 size=32.768k   6.68669GB/s
TakeInt64RandomIndicesWithNulls/1048576/1000     407881 ns     407877 ns       1722 null_percent=0.1 size=1048.58k   2.39426GB/s
TakeInt64RandomIndicesWithNulls/1048576/100      526618 ns     526614 ns       1326 null_percent=1 size=1048.58k   1.85442GB/s
TakeInt64RandomIndicesWithNulls/1048576/50       594920 ns     594919 ns       1191 null_percent=2 size=1048.58k    1.6415GB/s
TakeInt64RandomIndicesWithNulls/1048576/10       770036 ns     770034 ns        913 null_percent=10 size=1048.58k   1.26821GB/s
TakeInt64RandomIndicesWithNulls/1048576/1         17648 ns      17647 ns      39777 null_percent=100 size=1048.58k   55.3383GB/s
TakeInt64RandomIndicesWithNulls/1048576/0        183599 ns     183593 ns       3865 null_percent=0 size=1048.58k   5.31918GB/s
TakeInt64MonotonicIndices/32768/1000              10512 ns      10511 ns      65759 null_percent=0.1 size=32.768k   2.90332GB/s
TakeInt64MonotonicIndices/32768/100               10702 ns      10702 ns      64739 null_percent=1 size=32.768k   2.85162GB/s
TakeInt64MonotonicIndices/32768/50                10795 ns      10795 ns      64986 null_percent=2 size=32.768k   2.82712GB/s
TakeInt64MonotonicIndices/32768/10                10815 ns      10815 ns      64405 null_percent=10 size=32.768k   2.82182GB/s
TakeInt64MonotonicIndices/32768/1                  6550 ns       6549 ns     106737 null_percent=100 size=32.768k    4.6596GB/s
TakeInt64MonotonicIndices/32768/0                  5530 ns       5530 ns     129765 null_percent=0 size=32.768k   5.51821GB/s
TakeInt64MonotonicIndices/1048576/1000           269654 ns     269649 ns       2594 null_percent=0.1 size=1048.58k   3.62161GB/s
TakeInt64MonotonicIndices/1048576/100            280083 ns     280083 ns       2486 null_percent=1 size=1048.58k   3.48669GB/s
TakeInt64MonotonicIndices/1048576/50             291210 ns     291209 ns       2408 null_percent=2 size=1048.58k   3.35348GB/s
TakeInt64MonotonicIndices/1048576/10             364809 ns     364795 ns       1928 null_percent=10 size=1048.58k   2.67701GB/s
TakeInt64MonotonicIndices/1048576/1              151216 ns     151212 ns       4661 null_percent=100 size=1048.58k   6.45823GB/s
TakeInt64MonotonicIndices/1048576/0              112504 ns     112503 ns       6263 null_percent=0 size=1048.58k   8.68029GB/s

A branch to use for comparison benchmarks is https://github.com/wesm/arrow/tree/ARROW-5760-comparison since I had to make some benchmark related fixes

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

I'm also moving the index boundschecking to util/int_util.h, fixing bugs in it and adding unit tests

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

I've got a lot of changes locally so recommend holding off on reviewing for a bit until the CI is looking more green

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

OK, I have added more unit tests for the boundchecking code and added more rigorous unit testing for string/fixed-size-binary types in vector_take_test.cc. I'm going to get the test suite passing again but this can be reviewed

Latest benchmark run:

@wesm wesm requested a review from pitrou June 9, 2020 16:48
@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

I added a bunch more unit tests, I think I'm done with this aside from CI fixes and will move on to working on Filter

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

Appveyor is failing because of b058cf0

I opened https://issues.apache.org/jira/browse/ARROW-9085

@emkornfield
Copy link
Contributor

noticed a few small nits. Somebody more familiar with Take should review this.

@wesm
Copy link
Member Author

wesm commented Jun 10, 2020

I noticed some bugs while reviewing, I'll push some changes here in a few minutes

@wesm wesm force-pushed the ARROW-5760 branch 2 times, most recently from 1bb6759 to 4de6b70 Compare June 10, 2020 14:46
@wesm
Copy link
Member Author

wesm commented Jun 10, 2020

Done. Please review

@wesm
Copy link
Member Author

wesm commented Jun 10, 2020

I am preparing another large patch that uses this branch as a base so this patch will need to be reviewed and merged before the next patch (providing a streamlined Filter implementation) can be reviewed. CI is passing -- the only failure is an out-of-space error on the 390x box

@pitrou
Copy link
Member

pitrou commented Jun 11, 2020

I can confirm a performance increase here (2x to 5x faster) as well as a decrease in code size (libarrow.so has 15% less code!).

@pitrou
Copy link
Member

pitrou commented Jun 11, 2020

I'm going to push changes on this PR, I think. Please hold on :-).

@wesm
Copy link
Member Author

wesm commented Jun 11, 2020

Please go ahead

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. This is a very nice improvement.

@pitrou
Copy link
Member

pitrou commented Jun 11, 2020

@wesm
Copy link
Member Author

wesm commented Jun 11, 2020

@pitrou thanks for the fixes + improvements!

@wesm
Copy link
Member Author

wesm commented Jun 11, 2020

Here's an appveyor build https://ci.appveyor.com/project/wesm/arrow/builds/33463261. Will merge this shortly

@pitrou pitrou closed this in c07486c Jun 11, 2020
@wesm
Copy link
Member Author

wesm commented Jun 11, 2020

BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release

https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split

@wesm wesm deleted the ARROW-5760 branch June 11, 2020 18:13
@nealrichardson
Copy link
Member

Does mimalloc have the same special build constraints like jemalloc such that we have to build it bundled? (Also, are we going to get that jemalloc static lib symbol gluing done for 1.0?)

@wesm
Copy link
Member Author

wesm commented Jun 11, 2020

Same issue as jemalloc I think. I really hope that the static lib issue will be addressed for 1.0. I can try to do it if no one else can work on it but I see that as a last resort as there are other high value things I'm trying to get done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants