ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280

wesm · 2021-01-21T00:56:51Z

These are only preliminary benchmarks but may help in examining microperformance overhead related to ExecBatch and its implementation (as a vector<Datum>).

It may be desirable to devise an "array reference" data structure with few or no heap-allocated data structures and no shared_ptr interactions required to obtain memory addresses and other array information.

On my test machine (macOS i9-9880H 2.3ghz), I see about 472 CPU cycles per field overhead for each ExecBatch produced. These benchmarks take a record batch with 1M rows and 10 columns/fields and iterates through the rows in smaller ExecBatches of the indicated sizes

BM_ExecBatchIterator/256      8207877 ns      8204914 ns           81 items_per_second=121.878/s
BM_ExecBatchIterator/512      4421049 ns      4419958 ns          166 items_per_second=226.247/s
BM_ExecBatchIterator/1024     2056636 ns      2055369 ns          333 items_per_second=486.531/s
BM_ExecBatchIterator/2048     1056415 ns      1056264 ns          682 items_per_second=946.733/s
BM_ExecBatchIterator/4096      514276 ns       514136 ns         1246 items_per_second=1.94501k/s
BM_ExecBatchIterator/8192      262539 ns       262391 ns         2736 items_per_second=3.81111k/s
BM_ExecBatchIterator/16384     128995 ns       128974 ns         5398 items_per_second=7.75351k/s
BM_ExecBatchIterator/32768      64987 ns        64970 ns        10811 items_per_second=15.3917k/s

So for the 1024 case, it takes 2,055,369 ns to iterate through all 1024 batches. That seems a bit expensive to me (?) — I suspect we can do better while also improving compilation times and reducing generated code size by using simpler data structures in our compute internals.

github-actions · 2021-01-21T01:07:44Z

https://issues.apache.org/jira/browse/ARROW-8928

cpp/src/arrow/compute/internals_benchmark.cc

pitrou · 2021-01-21T17:22:34Z

cpp/src/arrow/compute/internals_benchmark.cc

"items" is a bit ambiguous in this benchmark, but I would expect something else than the number of iterations. Perhaps the number of arrays yielded in the inner loop above?

I added a comment to explain that iterations-per-second gives easier interpretation of the input-splitting overhead (so 300 iterations/second would mean 3.33ms of input splitting overhead for each use)

cpp/src/arrow/compute/internals_benchmark.cc

wesm · 2021-07-28T00:39:54Z

Some updated performance (gcc 9.3 locally on x86):

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_ExecBatchIterator/256     11314787 ns     11313272 ns           62 items_per_second=88.3918/s
BM_ExecBatchIterator/512      5670423 ns      5669598 ns          123 items_per_second=176.379/s
BM_ExecBatchIterator/1024     2903937 ns      2903272 ns          242 items_per_second=344.439/s
BM_ExecBatchIterator/2048     1461982 ns      1461711 ns          481 items_per_second=684.13/s
BM_ExecBatchIterator/4096      739382 ns       739235 ns          951 items_per_second=1.35275k/s
BM_ExecBatchIterator/8192      370264 ns       370207 ns         1892 items_per_second=2.70119k/s
BM_ExecBatchIterator/16384     186622 ns       186573 ns         3755 items_per_second=5.35983k/s
BM_ExecBatchIterator/32768      93581 ns        93567 ns         7437 items_per_second=10.6876k/s

The way to read this is that breaking ExecBatch with 32 primitive array fields into smaller ExecBatches (and then accessing a a data pointer in each batch) has an overhead of approximately:

2800 nanoseconds per batch
88.6 nanoseconds per batch per field

So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay 2900 microseconds to do so. On this same machine, I have:

In [2]: arr = np.random.randn(1 << 20)                                                                                                                                                         

In [3]: timeit arr * 2                                                                                                                                                                         
395 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think.

wesm · 2021-07-29T17:37:09Z

@cyb70289 @pitrou I addressed your comments, I think, could you take another look and then we can canonize this benchmark to help with the ExecBatch performance revamp?

cyb70289

LGTM, +1

pitrou

+1, thank you @wesm

github-actions bot added the Component: C++ label Jan 21, 2021

cyb70289 reviewed Jan 21, 2021

View reviewed changes

cpp/src/arrow/compute/internals_benchmark.cc Outdated Show resolved Hide resolved

cyb70289 reviewed Jan 21, 2021

View reviewed changes

cpp/src/arrow/compute/internals_benchmark.cc Outdated Show resolved Hide resolved

pitrou reviewed Jan 21, 2021

View reviewed changes

kszucs force-pushed the master branch from d238279 to 437c8c9 Compare January 25, 2021 21:53

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

wesm force-pushed the cpp-compute-microbenchmarks branch from a3c2798 to 12aacf4 Compare March 1, 2021 17:19

wesm added 3 commits July 27, 2021 14:28

Also access memory addresses

f84ce88

Put benchmark in function_benchmark.cc

81c9be2

A few more fields to make effect more pronounced

8c890fb

wesm force-pushed the cpp-compute-microbenchmarks branch from 12aacf4 to 8c890fb Compare July 28, 2021 00:29

wesm added 2 commits July 28, 2021 20:56

lint

80d6287

Address comments

e87518d

cyb70289 approved these changes Jul 30, 2021

View reviewed changes

Refine input parameters

a57cf97

pitrou approved these changes Aug 2, 2021

View reviewed changes

pitrou closed this in 3fa47f2 Aug 2, 2021

asfimport mentioned this pull request Aug 9, 2021

[C++] Measure microperformance associated with ExecBatchIterator #25058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280

ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280

Uh oh!

wesm commented Jan 21, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Jan 21, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou Jan 21, 2021

Uh oh!

wesm Jul 29, 2021

Uh oh!

Uh oh!

Uh oh!

wesm commented Jul 28, 2021

Uh oh!

wesm commented Jul 29, 2021

Uh oh!

cyb70289 left a comment

Uh oh!

pitrou left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280

ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280

Uh oh!

Conversation

wesm commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou Jan 21, 2021

Choose a reason for hiding this comment

Uh oh!

wesm Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wesm commented Jul 28, 2021

Uh oh!

wesm commented Jul 29, 2021

Uh oh!

cyb70289 left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Jan 21, 2021 •

edited

Loading