ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation #7382

wesm · 2020-06-09T04:37:33Z

I was going to work on Filter in this PR but it got to be too much to review so I'll tackle that in a separate PR.

Summary of what's in this PR:

Uses BitBlockCounter to speed up processing of mostly-not-null indices vectors
Performance of primitive takes improved ~2-3x across the board. Builder classes are no longer used for primitive types.
List-based takes seem about 4x faster based on the (limited) benchmarks
Size of vector_take.cc.o reduced from 5.9 MB to 495 KB on -O3 build with clang-8. Compilation time is correspondingly reduced. The refactor for the new kernels framework I think had increased the code size in this module beyond what was in 0.17.x.
Kernels for primitive types of same size are reused
Signed/unsigned indices are processed using unsigned integer code paths after they've been boundschecked (so we know that the signed ints have no negative values)
Adds new kernel input type checking rules so that the number of registered take kernels has gone from over 300 to only 13. This means faster dispatching, too.
Does vectorized boundschecking
Take no longer uses kernels/vector_selection_internal.h, but it's still used by Filter. I plan to delete it after completing the new Filter implementation
Fixes to the prior benchmarking PR

Note: support for doing Take with unions has been temporarily disabled. Since there is so little code in the codebase that deals with unions at the moment I felt it would be better to address this in follow up work. Here are the currently available kernels:

[VectorKernel<(array[primitive], array[integer]) -> computed>,
 VectorKernel<(array[binary-like], array[integer]) -> computed>,
 VectorKernel<(array[large-binary-like], array[integer]) -> computed>,
 VectorKernel<(array[Type::FIXED_SIZE_BINARY], array[integer]) -> computed>,
 VectorKernel<(array[null], array[integer]) -> computed>,
 VectorKernel<(array[Type::DECIMAL], array[integer]) -> computed>,
 VectorKernel<(array[Type::DICTIONARY], array[integer]) -> computed>,
 VectorKernel<(array[Type::EXTENSION], array[integer]) -> computed>,
 VectorKernel<(array[Type::LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::LARGE_LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::FIXED_SIZE_LIST], array[integer]) -> computed>,
 VectorKernel<(array[Type::STRUCT], array[integer]) -> computed>,
 VectorKernel<(array[Type::MAP], array[integer]) -> computed>]

There's some code duplication that can be improved so I think things can be improved in follow up PRs using the benchmarks and code size metrics as a strict guideline for making changes.

cpp/src/arrow/compute/kernels/vector_take.cc

cpp/src/arrow/util/bit_block_counter.h

github-actions · 2020-06-09T04:46:49Z

https://issues.apache.org/jira/browse/ARROW-5760

wesm · 2020-06-09T04:55:31Z

I'll fix the unit tests tomorrow. There's a lot of benchmarks but here are the take-int64 ones:

before

----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
TakeInt64RandomIndicesNoNulls/32768/1000          23908 ns      23908 ns      29764 null_percent=0.1 size=32.768k   1.27646GB/s
TakeInt64RandomIndicesNoNulls/32768/100           24202 ns      24202 ns      29127 null_percent=1 size=32.768k   1.26094GB/s
TakeInt64RandomIndicesNoNulls/32768/50            24959 ns      24959 ns      28736 null_percent=2 size=32.768k   1.22271GB/s
TakeInt64RandomIndicesNoNulls/32768/10            25730 ns      25730 ns      27319 null_percent=10 size=32.768k   1.18607GB/s
TakeInt64RandomIndicesNoNulls/32768/1             20299 ns      20299 ns      34665 null_percent=100 size=32.768k   1.50342GB/s
TakeInt64RandomIndicesNoNulls/32768/0             20151 ns      20152 ns      34526 null_percent=0 size=32.768k    1.5144GB/s
TakeInt64RandomIndicesNoNulls/1048576/1000       831642 ns     831646 ns        835 null_percent=0.1 size=1048.58k   1.17425GB/s
TakeInt64RandomIndicesNoNulls/1048576/100        822095 ns     822099 ns        859 null_percent=1 size=1048.58k   1.18789GB/s
TakeInt64RandomIndicesNoNulls/1048576/50         822710 ns     822720 ns        853 null_percent=2 size=1048.58k   1.18699GB/s
TakeInt64RandomIndicesNoNulls/1048576/10         897409 ns     897420 ns        785 null_percent=10 size=1048.58k   1114.31MB/s
TakeInt64RandomIndicesNoNulls/1048576/1          539675 ns     539682 ns       1311 null_percent=100 size=1048.58k   1.80951GB/s
TakeInt64RandomIndicesNoNulls/1048576/0          722993 ns     722999 ns        990 null_percent=0 size=1048.58k   1.35071GB/s
TakeInt64RandomIndicesWithNulls/32768/1000        23156 ns      23156 ns      30099 null_percent=0.1 size=32.768k    1.3179GB/s
TakeInt64RandomIndicesWithNulls/32768/100         23752 ns      23753 ns      29571 null_percent=1 size=32.768k   1.28481GB/s
TakeInt64RandomIndicesWithNulls/32768/50          23847 ns      23847 ns      29513 null_percent=2 size=32.768k    1.2797GB/s
TakeInt64RandomIndicesWithNulls/32768/10          24102 ns      24102 ns      28948 null_percent=10 size=32.768k    1.2662GB/s
TakeInt64RandomIndicesWithNulls/32768/1           17271 ns      17271 ns      40594 null_percent=100 size=32.768k   1.76697GB/s
TakeInt64RandomIndicesWithNulls/32768/0           19940 ns      19940 ns      35293 null_percent=0 size=32.768k   1.53043GB/s
TakeInt64RandomIndicesWithNulls/1048576/1000     864066 ns     864067 ns        813 null_percent=0.1 size=1048.58k   1.13019GB/s
TakeInt64RandomIndicesWithNulls/1048576/100      871374 ns     871377 ns        807 null_percent=1 size=1048.58k   1.12071GB/s
TakeInt64RandomIndicesWithNulls/1048576/50       878898 ns     878904 ns        776 null_percent=2 size=1048.58k   1.11111GB/s
TakeInt64RandomIndicesWithNulls/1048576/10       992572 ns     992581 ns        718 null_percent=10 size=1048.58k   1007.47MB/s
TakeInt64RandomIndicesWithNulls/1048576/1        451244 ns     451246 ns       1513 null_percent=100 size=1048.58k   2.16415GB/s
TakeInt64RandomIndicesWithNulls/1048576/0        713145 ns     713151 ns        983 null_percent=0 size=1048.58k   1.36936GB/s
TakeInt64MonotonicIndices/32768/1000              23928 ns      23928 ns      28710 null_percent=0.1 size=32.768k   1.27538GB/s
TakeInt64MonotonicIndices/32768/100               24378 ns      24379 ns      28777 null_percent=1 size=32.768k   1.25182GB/s
TakeInt64MonotonicIndices/32768/50                24459 ns      24459 ns      28711 null_percent=2 size=32.768k   1.24771GB/s
TakeInt64MonotonicIndices/32768/10                25479 ns      25479 ns      27342 null_percent=10 size=32.768k   1.19774GB/s
TakeInt64MonotonicIndices/32768/1                 19608 ns      19608 ns      35646 null_percent=100 size=32.768k   1.55641GB/s
TakeInt64MonotonicIndices/32768/0                 19782 ns      19782 ns      35704 null_percent=0 size=32.768k   1.54267GB/s
TakeInt64MonotonicIndices/1048576/1000           663079 ns     663081 ns       1056 null_percent=0.1 size=1048.58k   1.47276GB/s
TakeInt64MonotonicIndices/1048576/100            679547 ns     679553 ns       1044 null_percent=1 size=1048.58k   1.43707GB/s
TakeInt64MonotonicIndices/1048576/50             687064 ns     687063 ns       1026 null_percent=2 size=1048.58k   1.42136GB/s
TakeInt64MonotonicIndices/1048576/10             759665 ns     759665 ns        924 null_percent=10 size=1048.58k   1.28552GB/s
TakeInt64MonotonicIndices/1048576/1              535625 ns     535630 ns       1303 null_percent=100 size=1048.58k   1.82321GB/s
TakeInt64MonotonicIndices/1048576/0              537183 ns     537185 ns       1327 null_percent=0 size=1048.58k   1.81793GB/s

after

----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
TakeInt64RandomIndicesNoNulls/32768/1000          10195 ns      10195 ns      71377 null_percent=0.1 size=32.768k   2.99348GB/s
TakeInt64RandomIndicesNoNulls/32768/100           10456 ns      10456 ns      67748 null_percent=1 size=32.768k    2.9187GB/s
TakeInt64RandomIndicesNoNulls/32768/50            10479 ns      10479 ns      68004 null_percent=2 size=32.768k   2.91215GB/s
TakeInt64RandomIndicesNoNulls/32768/10            10730 ns      10730 ns      66188 null_percent=10 size=32.768k   2.84409GB/s
TakeInt64RandomIndicesNoNulls/32768/1              5632 ns       5632 ns     122772 null_percent=100 size=32.768k   5.41893GB/s
TakeInt64RandomIndicesNoNulls/32768/0              4726 ns       4726 ns     145938 null_percent=0 size=32.768k   6.45768GB/s
TakeInt64RandomIndicesNoNulls/1048576/1000       370236 ns     370236 ns       1893 null_percent=0.1 size=1048.58k   2.63768GB/s
TakeInt64RandomIndicesNoNulls/1048576/100        371821 ns     371818 ns       1868 null_percent=1 size=1048.58k   2.62645GB/s
TakeInt64RandomIndicesNoNulls/1048576/50         377290 ns     377286 ns       1868 null_percent=2 size=1048.58k   2.58839GB/s
TakeInt64RandomIndicesNoNulls/1048576/10         420583 ns     420581 ns       1668 null_percent=10 size=1048.58k   2.32194GB/s
TakeInt64RandomIndicesNoNulls/1048576/1          120646 ns     120646 ns       5825 null_percent=100 size=1048.58k   8.09444GB/s
TakeInt64RandomIndicesNoNulls/1048576/0          185643 ns     185642 ns       3736 null_percent=0 size=1048.58k   5.26045GB/s
TakeInt64RandomIndicesWithNulls/32768/1000        11042 ns      11042 ns      62278 null_percent=0.1 size=32.768k    2.7637GB/s
TakeInt64RandomIndicesWithNulls/32768/100         13464 ns      13464 ns      52471 null_percent=1 size=32.768k    2.2666GB/s
TakeInt64RandomIndicesWithNulls/32768/50          14991 ns      14991 ns      46472 null_percent=2 size=32.768k   2.03574GB/s
TakeInt64RandomIndicesWithNulls/32768/10          17244 ns      17244 ns      40158 null_percent=10 size=32.768k   1.76978GB/s
TakeInt64RandomIndicesWithNulls/32768/1            2366 ns       2366 ns     298987 null_percent=100 size=32.768k   12.8974GB/s
TakeInt64RandomIndicesWithNulls/32768/0            4564 ns       4564 ns     151272 null_percent=0 size=32.768k   6.68669GB/s
TakeInt64RandomIndicesWithNulls/1048576/1000     407881 ns     407877 ns       1722 null_percent=0.1 size=1048.58k   2.39426GB/s
TakeInt64RandomIndicesWithNulls/1048576/100      526618 ns     526614 ns       1326 null_percent=1 size=1048.58k   1.85442GB/s
TakeInt64RandomIndicesWithNulls/1048576/50       594920 ns     594919 ns       1191 null_percent=2 size=1048.58k    1.6415GB/s
TakeInt64RandomIndicesWithNulls/1048576/10       770036 ns     770034 ns        913 null_percent=10 size=1048.58k   1.26821GB/s
TakeInt64RandomIndicesWithNulls/1048576/1         17648 ns      17647 ns      39777 null_percent=100 size=1048.58k   55.3383GB/s
TakeInt64RandomIndicesWithNulls/1048576/0        183599 ns     183593 ns       3865 null_percent=0 size=1048.58k   5.31918GB/s
TakeInt64MonotonicIndices/32768/1000              10512 ns      10511 ns      65759 null_percent=0.1 size=32.768k   2.90332GB/s
TakeInt64MonotonicIndices/32768/100               10702 ns      10702 ns      64739 null_percent=1 size=32.768k   2.85162GB/s
TakeInt64MonotonicIndices/32768/50                10795 ns      10795 ns      64986 null_percent=2 size=32.768k   2.82712GB/s
TakeInt64MonotonicIndices/32768/10                10815 ns      10815 ns      64405 null_percent=10 size=32.768k   2.82182GB/s
TakeInt64MonotonicIndices/32768/1                  6550 ns       6549 ns     106737 null_percent=100 size=32.768k    4.6596GB/s
TakeInt64MonotonicIndices/32768/0                  5530 ns       5530 ns     129765 null_percent=0 size=32.768k   5.51821GB/s
TakeInt64MonotonicIndices/1048576/1000           269654 ns     269649 ns       2594 null_percent=0.1 size=1048.58k   3.62161GB/s
TakeInt64MonotonicIndices/1048576/100            280083 ns     280083 ns       2486 null_percent=1 size=1048.58k   3.48669GB/s
TakeInt64MonotonicIndices/1048576/50             291210 ns     291209 ns       2408 null_percent=2 size=1048.58k   3.35348GB/s
TakeInt64MonotonicIndices/1048576/10             364809 ns     364795 ns       1928 null_percent=10 size=1048.58k   2.67701GB/s
TakeInt64MonotonicIndices/1048576/1              151216 ns     151212 ns       4661 null_percent=100 size=1048.58k   6.45823GB/s
TakeInt64MonotonicIndices/1048576/0              112504 ns     112503 ns       6263 null_percent=0 size=1048.58k   8.68029GB/s

A branch to use for comparison benchmarks is https://github.com/wesm/arrow/tree/ARROW-5760-comparison since I had to make some benchmark related fixes

cpp/src/arrow/compute/benchmark_util.h

wesm · 2020-06-09T14:18:47Z

I'm also moving the index boundschecking to util/int_util.h, fixing bugs in it and adding unit tests

wesm · 2020-06-09T15:54:59Z

I've got a lot of changes locally so recommend holding off on reviewing for a bit until the CI is looking more green

wesm · 2020-06-09T16:48:05Z

OK, I have added more unit tests for the boundchecking code and added more rigorous unit testing for string/fixed-size-binary types in vector_take_test.cc. I'm going to get the test suite passing again but this can be reviewed

Latest benchmark run:

Old code: https://gist.github.com/wesm/85ed9b59a24f2f63c9eb3b2bb12bf364
New code: https://gist.github.com/wesm/1a83bc6bb3cb8c1b87943cd49a0ac3ff

cpp/src/arrow/compute/kernels/vector_take.cc

wesm · 2020-06-09T22:21:10Z

I added a bunch more unit tests, I think I'm done with this aside from CI fixes and will move on to working on Filter

wesm · 2020-06-09T23:20:23Z

Appveyor is failing because of b058cf0

I opened https://issues.apache.org/jira/browse/ARROW-9085

cpp/src/arrow/util/int_util_test.cc

cpp/src/arrow/util/bit_block_counter_test.cc

emkornfield · 2020-06-10T04:56:52Z

noticed a few small nits. Somebody more familiar with Take should review this.

wesm · 2020-06-10T14:15:20Z

I noticed some bugs while reviewing, I'll push some changes here in a few minutes

wesm · 2020-06-10T14:46:12Z

Done. Please review

wesm · 2020-06-10T21:48:55Z

I am preparing another large patch that uses this branch as a base so this patch will need to be reviewed and merged before the next patch (providing a streamlined Filter implementation) can be reviewed. CI is passing -- the only failure is an out-of-space error on the 390x box

… fixed size binary, large string

…rks in items/s rather than bytes for comparability across data types

pitrou · 2020-06-11T10:36:44Z

I can confirm a performance increase here (2x to 5x faster) as well as a decrease in code size (libarrow.so has 15% less code!).

cpp/src/arrow/compute/kernels/vector_take_test.cc

cpp/src/arrow/util/bit_block_counter.h

pitrou · 2020-06-11T14:19:25Z

I'm going to push changes on this PR, I think. Please hold on :-).

wesm · 2020-06-11T14:21:13Z

Please go ahead

…chmark size

pitrou

+1 from me. This is a very nice improvement.

pitrou · 2020-06-11T15:28:26Z

Travis-CI: https://travis-ci.org/github/wesm/arrow/builds/697263135

wesm · 2020-06-11T15:36:04Z

@pitrou thanks for the fixes + improvements!

wesm · 2020-06-11T15:36:50Z

Here's an appveyor build https://ci.appveyor.com/project/wesm/arrow/builds/33463261. Will merge this shortly

wesm · 2020-06-11T18:13:29Z

BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release

https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split

nealrichardson · 2020-06-11T18:21:22Z

Does mimalloc have the same special build constraints like jemalloc such that we have to build it bundled? (Also, are we going to get that jemalloc static lib symbol gluing done for 1.0?)

wesm · 2020-06-11T18:28:56Z

Same issue as jemalloc I think. I really hope that the static lib issue will be addressed for 1.0. I can try to do it if no one else can work on it but I see that as a last resort as there are other high value things I'm trying to get done.

cpp/src/arrow/compute/kernels/vector_take.cc

wesm changed the title ~~ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size~~ ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation Jun 9, 2020

wesm commented Jun 9, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/vector_take.cc Outdated Show resolved Hide resolved

wesm commented Jun 9, 2020

View reviewed changes

cpp/src/arrow/util/bit_block_counter.h Outdated Show resolved Hide resolved

frankdjx reviewed Jun 9, 2020

View reviewed changes

cpp/src/arrow/compute/benchmark_util.h Outdated Show resolved Hide resolved

wesm force-pushed the ARROW-5760 branch from d01d00f to f017ba3 Compare June 9, 2020 16:42

wesm requested a review from pitrou June 9, 2020 16:48

wesm commented Jun 9, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/vector_take.cc Show resolved Hide resolved

wesm force-pushed the ARROW-5760 branch from 72a7d99 to ac7072d Compare June 9, 2020 17:06

wesm mentioned this pull request Jun 9, 2020

ARROW-8866: [C++] Split UNION into SPARSE_UNION and DENSE_UNION #7378

Closed

wesm force-pushed the ARROW-5760 branch from ac7072d to 088031c Compare June 9, 2020 22:20

emkornfield reviewed Jun 10, 2020

View reviewed changes

cpp/src/arrow/util/int_util_test.cc Outdated Show resolved Hide resolved

emkornfield reviewed Jun 10, 2020

View reviewed changes

cpp/src/arrow/util/bit_block_counter_test.cc Show resolved Hide resolved

wesm force-pushed the ARROW-5760 branch 2 times, most recently from 1bb6759 to 4de6b70 Compare June 10, 2020 14:46

wesm added 5 commits June 11, 2020 12:32

New Take implementation and support infrastructure

cbd492d

Add 50% / no nulls cases to defaults

8282d38

Add more comprehensive random data tests, including for boolean type,…

fee3671

… fixed size binary, large string

Some bug fixes and cleaning, address initial feedback. Report benchma…

a1207c3

…rks in items/s rather than bytes for comparability across data types

Add additional test where indices do not have nulls for better coverage

8af5d3e

Less benchmarks

e54f193

pitrou force-pushed the ARROW-5760 branch from 2ee7f38 to e54f193 Compare June 11, 2020 10:36

pitrou reviewed Jun 11, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/vector_take_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/vector_take_test.cc Show resolved Hide resolved

cpp/src/arrow/util/bit_block_counter.h Outdated Show resolved Hide resolved

pitrou added 2 commits June 11, 2020 16:38

Some fixes

b5b3eea

Presize binary builder with a heuristic, and use more interesting ben…

eaf6e1f

…chmark size

pitrou approved these changes Jun 11, 2020

View reviewed changes

pitrou closed this in c07486c Jun 11, 2020

wesm deleted the ARROW-5760 branch June 11, 2020 18:13

bkietz reviewed Jun 11, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/vector_take.cc Show resolved Hide resolved

asfimport mentioned this pull request Jun 11, 2020

[C++] Optimize Take implementation #22185

Closed

ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation #7382

ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation #7382

Uh oh!

Conversation

wesm commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 9, 2020

Uh oh!

wesm commented Jun 9, 2020

Uh oh!

Uh oh!

wesm commented Jun 9, 2020

Uh oh!

wesm commented Jun 9, 2020

Uh oh!

wesm commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wesm commented Jun 9, 2020

Uh oh!

wesm commented Jun 9, 2020

Uh oh!

Uh oh!

Uh oh!

emkornfield commented Jun 10, 2020

Uh oh!

wesm commented Jun 10, 2020

Uh oh!

wesm commented Jun 10, 2020

Uh oh!

wesm commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jun 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jun 11, 2020

Uh oh!

wesm commented Jun 11, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 11, 2020

Uh oh!

wesm commented Jun 11, 2020

Uh oh!

wesm commented Jun 11, 2020

Uh oh!

wesm commented Jun 11, 2020

Uh oh!

nealrichardson commented Jun 11, 2020

Uh oh!

wesm commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wesm commented Jun 9, 2020 •

edited

Loading

wesm commented Jun 9, 2020 •

edited

Loading

wesm commented Jun 10, 2020 •

edited

Loading

wesm commented Jun 11, 2020 •

edited

Loading