Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jan 21, 2021

These are only preliminary benchmarks but may help in examining microperformance overhead related to ExecBatch and its implementation (as a vector<Datum>).

It may be desirable to devise an "array reference" data structure with few or no heap-allocated data structures and no shared_ptr interactions required to obtain memory addresses and other array information.

On my test machine (macOS i9-9880H 2.3ghz), I see about 472 CPU cycles per field overhead for each ExecBatch produced. These benchmarks take a record batch with 1M rows and 10 columns/fields and iterates through the rows in smaller ExecBatches of the indicated sizes

BM_ExecBatchIterator/256      8207877 ns      8204914 ns           81 items_per_second=121.878/s
BM_ExecBatchIterator/512      4421049 ns      4419958 ns          166 items_per_second=226.247/s
BM_ExecBatchIterator/1024     2056636 ns      2055369 ns          333 items_per_second=486.531/s
BM_ExecBatchIterator/2048     1056415 ns      1056264 ns          682 items_per_second=946.733/s
BM_ExecBatchIterator/4096      514276 ns       514136 ns         1246 items_per_second=1.94501k/s
BM_ExecBatchIterator/8192      262539 ns       262391 ns         2736 items_per_second=3.81111k/s
BM_ExecBatchIterator/16384     128995 ns       128974 ns         5398 items_per_second=7.75351k/s
BM_ExecBatchIterator/32768      64987 ns        64970 ns        10811 items_per_second=15.3917k/s

So for the 1024 case, it takes 2,055,369 ns to iterate through all 1024 batches. That seems a bit expensive to me (?) — I suspect we can do better while also improving compilation times and reducing generated code size by using simpler data structures in our compute internals.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"items" is a bit ambiguous in this benchmark, but I would expect something else than the number of iterations. Perhaps the number of arrays yielded in the inner loop above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to explain that iterations-per-second gives easier interpretation of the input-splitting overhead (so 300 iterations/second would mean 3.33ms of input splitting overhead for each use)

@wesm wesm force-pushed the cpp-compute-microbenchmarks branch from 12aacf4 to 8c890fb Compare July 28, 2021 00:29
@wesm
Copy link
Member Author

wesm commented Jul 28, 2021

Some updated performance (gcc 9.3 locally on x86):

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_ExecBatchIterator/256     11314787 ns     11313272 ns           62 items_per_second=88.3918/s
BM_ExecBatchIterator/512      5670423 ns      5669598 ns          123 items_per_second=176.379/s
BM_ExecBatchIterator/1024     2903937 ns      2903272 ns          242 items_per_second=344.439/s
BM_ExecBatchIterator/2048     1461982 ns      1461711 ns          481 items_per_second=684.13/s
BM_ExecBatchIterator/4096      739382 ns       739235 ns          951 items_per_second=1.35275k/s
BM_ExecBatchIterator/8192      370264 ns       370207 ns         1892 items_per_second=2.70119k/s
BM_ExecBatchIterator/16384     186622 ns       186573 ns         3755 items_per_second=5.35983k/s
BM_ExecBatchIterator/32768      93581 ns        93567 ns         7437 items_per_second=10.6876k/s

The way to read this is that breaking ExecBatch with 32 primitive array fields into smaller ExecBatches (and then accessing a a data pointer in each batch) has an overhead of approximately:

  • 2800 nanoseconds per batch
  • 88.6 nanoseconds per batch per field

So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay 2900 microseconds to do so. On this same machine, I have:

In [2]: arr = np.random.randn(1 << 20)                                                                                                                                                         

In [3]: timeit arr * 2                                                                                                                                                                         
395 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think.

@wesm
Copy link
Member Author

wesm commented Jul 29, 2021

@cyb70289 @pitrou I addressed your comments, I think, could you take another look and then we can canonize this benchmark to help with the ExecBatch performance revamp?

Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @wesm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants