-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead #9280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"items" is a bit ambiguous in this benchmark, but I would expect something else than the number of iterations. Perhaps the number of arrays yielded in the inner loop above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to explain that iterations-per-second gives easier interpretation of the input-splitting overhead (so 300 iterations/second would mean 3.33ms of input splitting overhead for each use)
d4608a9 to
356c300
Compare
a3c2798 to
12aacf4
Compare
12aacf4 to
8c890fb
Compare
|
Some updated performance (gcc 9.3 locally on x86): The way to read this is that breaking
So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay 2900 microseconds to do so. On this same machine, I have: This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think. |
cyb70289
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, +1
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @wesm
These are only preliminary benchmarks but may help in examining microperformance overhead related to
ExecBatchand its implementation (as avector<Datum>).It may be desirable to devise an "array reference" data structure with few or no heap-allocated data structures and no
shared_ptrinteractions required to obtain memory addresses and other array information.On my test machine (macOS i9-9880H 2.3ghz), I see about 472 CPU cycles per field overhead for each ExecBatch produced. These benchmarks take a record batch with 1M rows and 10 columns/fields and iterates through the rows in smaller ExecBatches of the indicated sizes
So for the 1024 case, it takes 2,055,369 ns to iterate through all 1024 batches. That seems a bit expensive to me (?) — I suspect we can do better while also improving compilation times and reducing generated code size by using simpler data structures in our compute internals.