-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14479: [C++] Hash Join Microbenchmarks #11876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4297e17 to
2ce4d08
Compare
b4c6baf to
37f7bab
Compare
bdaae1c to
224a047
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great benchmark. I think you've captured several interesting aspects of the hash/join performance.
That being said, this benchmark is too slow. It takes close to an hour to run. Maybe compile flags can be used to distinguish between "benchmark" (a small space of parameters to help protect against regressions) and "experiment" (a wider set of parameters to validate the hash/join implementation and useful for people looking to improve the hash/join code)?
|
Regarding the benchmark being too slow: Making the number of selectivity test cases smaller and removing a lot of the slow utf8 runs, it now runs in 16.64 mins on my machine, which is more reasonable? |
It would certainly be an outlier. Right now a full run of 52 benchmarks (including compilation) takes ~ 100 minutes. So we are averaging ~2 minutes per benchmark (some are quite a bit faster and some are quite a bit slower). I don't have the distribution for all of them. Benchmarking tools like conbench run the full suite of C++ benchmarks against every Arrow commit. So at the very least I think there would need to be a compelling case why we can't get away with something similar here (again, it may be we want a reduced "for-automation" set which is the default and a more complete "for-investigation" set) |
|
I agree that 16 minutes sounds too long. I assume this is because many combinations are being tested. I'm also surprised that the benchmarks are using OpenMP, is that actually required? |
|
For the record, our entire sorting benchmarks take 2 minutes here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have to test with all these tuples of int64 types? Or is it sufficient for performance analysis to test only {int64} and {int64,int64,int64,int64} for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main use case is we want to see if there's any huge performance hit to having multiple fields or one field (i.e. 2 x int64 vs fixed_size_binary(16)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit my ignorance here, but is it useful to benchmark all join types or are some just going to yield the same performance of another (because of symmetry)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the build side is always on the right and the probe side is always on the left, so it's not actually symmetrical.
Agreed that we should think more about what we're expecting from this. Does the fine-grained selection of benchmark parameters really help dive into performance issues? Or is the coverage just excessive? We could have pretty much the same fine-grained approach for many other benchmarks (I gave the sorting example above, which definitely encourages a combinatory explosion in benchmark numbers as well), but it would multiply the total time for running benchmarks by a non-trivial factor. Besides continuous benchmarking, I'll point out that interactive work with benchmarks is less pleasant and more tedious when individual benchmark suites are too long (again that's my experience with the current sorting benchmarks, yet they're 8x faster than this). |
We've run into this a few times in arrowbench (and actually already have this implemented in arrowbench for exactly this reason). I don't know of an example of this in the C++ benchmarks, it might be as simple as making it so that they don't run when |
|
What would be the best way to provide a light version of the benchmark? Should I create a |
|
@ElenaHenderson might have some ideas about benchmark filtering and hard/easy that would be to implement. Currently, the automation runs whatever is run when we do |
|
I'm not opposed to adding flags, but I think it would be more widely beneficial to try and reduce the matrix of parameter values. |
|
Yes, I didn't mean to imply that the flag / gating should be the only thing we do — we should do both (or only the parameter reduction if that gets the space small enough to just run them all) |
903618a to
61f5f32
Compare
5a8cdc2 to
0250ce0
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for doing this. The new flag works great. Just a few more notes. I think you're just about there but there is a glitch in the X axis for UTF8 now.
@jonkeane we will need to make sure to set ARROW_BUILD_DETAILED_BENCHMARKS=OFF on any kind of regularly run benchmarks (conbench?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DCHECK_OK(join_->InputReceived(tid, 1 /* side */, *it)); | |
| DCHECK_OK(join_->InputReceived(tid, /*side=*/1, *it)); |
For consistency foo(/*param_name=*/0) is more common in the code base than foo(0 /* param_name */)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wrap this with a struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had a few more stats, and figured it would be clearer to have it in the stats_ struct. When I switched to having only one stat, I decided to keep the struct in case we added more again in the future (like if we add cache miss counting and such)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
No particular concern from me. |
is it possible to make the default/unset option be that they don't run? Then we won't need to make a PR against conbench (and folks running the benchmarks locally won't need to remember to set it if they don't want these). |
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no concern.
|
@jonkeane @westonpace I did set |
d8fa07f to
94bbb63
Compare
94bbb63 to
ddab7f5
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@save-buffer is right, my mistake, it looks like the default is OFF. Thanks for the updates. Once CI passes I will merge.
|
Benchmark runs are scheduled for baseline = 2ec4e99 and contender = ab86daf. ab86daf is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |


Added microbenchmarks for Hash Join (using OpenMP for parallelism for now). Also added a python script that analyzes the benchmark file and makes a lot of pretty graphs!