Skip to content

Conversation

@save-buffer
Copy link
Contributor

Added microbenchmarks for Hash Join (using OpenMP for parallelism for now). Also added a python script that analyzes the benchmark file and makes a lot of pretty graphs!

@github-actions
Copy link

github-actions bot commented Dec 7, 2021

@westonpace westonpace self-requested a review December 7, 2021 21:25
@save-buffer save-buffer force-pushed the sasha_benchmark branch 2 times, most recently from bdaae1c to 224a047 Compare December 9, 2021 06:04
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great benchmark. I think you've captured several interesting aspects of the hash/join performance.

That being said, this benchmark is too slow. It takes close to an hour to run. Maybe compile flags can be used to distinguish between "benchmark" (a small space of parameters to help protect against regressions) and "experiment" (a wider set of parameters to validate the hash/join implementation and useful for people looking to improve the hash/join code)?

@save-buffer
Copy link
Contributor Author

Regarding the benchmark being too slow: Making the number of selectivity test cases smaller and removing a lot of the slow utf8 runs, it now runs in 16.64 mins on my machine, which is more reasonable?

@westonpace
Copy link
Member

Regarding the benchmark being too slow: Making the number of selectivity test cases smaller and removing a lot of the slow utf8 runs, it now runs in 16.64 mins on my machine, which is more reasonable?

It would certainly be an outlier. Right now a full run of 52 benchmarks (including compilation) takes ~ 100 minutes. So we are averaging ~2 minutes per benchmark (some are quite a bit faster and some are quite a bit slower). I don't have the distribution for all of them.

Benchmarking tools like conbench run the full suite of C++ benchmarks against every Arrow commit. So at the very least I think there would need to be a compelling case why we can't get away with something similar here (again, it may be we want a reduced "for-automation" set which is the default and a more complete "for-investigation" set)

CC @jonkeane @pitrou for a second opinion

@pitrou
Copy link
Member

pitrou commented Dec 15, 2021

I agree that 16 minutes sounds too long. I assume this is because many combinations are being tested. I'm also surprised that the benchmarks are using OpenMP, is that actually required?

@pitrou
Copy link
Member

pitrou commented Dec 15, 2021

For the record, our entire sorting benchmarks take 2 minutes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to test with all these tuples of int64 types? Or is it sufficient for performance analysis to test only {int64} and {int64,int64,int64,int64} for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main use case is we want to see if there's any huge performance hit to having multiple fields or one field (i.e. 2 x int64 vs fixed_size_binary(16)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit my ignorance here, but is it useful to benchmark all join types or are some just going to yield the same performance of another (because of symmetry)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the build side is always on the right and the probe side is always on the left, so it's not actually symmetrical.

@pitrou
Copy link
Member

pitrou commented Dec 15, 2021

it may be we want a reduced "for-automation" set which is the default and a more complete "for-investigation" set

Agreed that we should think more about what we're expecting from this. Does the fine-grained selection of benchmark parameters really help dive into performance issues? Or is the coverage just excessive?

We could have pretty much the same fine-grained approach for many other benchmarks (I gave the sorting example above, which definitely encourages a combinatory explosion in benchmark numbers as well), but it would multiply the total time for running benchmarks by a non-trivial factor.

Besides continuous benchmarking, I'll point out that interactive work with benchmarks is less pleasant and more tedious when individual benchmark suites are too long (again that's my experience with the current sorting benchmarks, yet they're 8x faster than this).

@jonkeane
Copy link
Member

again, it may be we want a reduced "for-automation" set which is the default and a more complete "for-investigation" set

We've run into this a few times in arrowbench (and actually already have this implemented in arrowbench for exactly this reason). I don't know of an example of this in the C++ benchmarks, it might be as simple as making it so that they don't run when archery benchmark run is called, but instead we need an additional --full or something to run the full set?

@save-buffer
Copy link
Contributor Author

What would be the best way to provide a light version of the benchmark? Should I create a hash_join_benchmark_lite.cc? Or should this be done with compile-time arguments?
An alternative way could be to have the automation tooling specify which benchmarks to run using benchmark_filter regex.

@jonkeane
Copy link
Member

@ElenaHenderson might have some ideas about benchmark filtering and hard/easy that would be to implement. Currently, the automation runs whatever is run when we do archery benchmark run so it would be sufficient to make the extraneous benchmarks not run without some other flag (or separate different command, or...)

@pitrou
Copy link
Member

pitrou commented Dec 16, 2021

I'm not opposed to adding flags, but I think it would be more widely beneficial to try and reduce the matrix of parameter values.

@jonkeane
Copy link
Member

Yes, I didn't mean to imply that the flag / gating should be the only thing we do — we should do both (or only the parameter reduction if that gets the space small enough to just run them all)

@save-buffer save-buffer force-pushed the sasha_benchmark branch 2 times, most recently from 5a8cdc2 to 0250ce0 Compare January 7, 2022 22:13
@westonpace westonpace self-requested a review January 7, 2022 22:18
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for doing this. The new flag works great. Just a few more notes. I think you're just about there but there is a glitch in the X axis for UTF8 now.

@jonkeane we will need to make sure to set ARROW_BUILD_DETAILED_BENCHMARKS=OFF on any kind of regularly run benchmarks (conbench?).

@pitrou / @kou Any last thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DCHECK_OK(join_->InputReceived(tid, 1 /* side */, *it));
DCHECK_OK(join_->InputReceived(tid, /*side=*/1, *it));

For consistency foo(/*param_name=*/0) is more common in the code base than foo(0 /* param_name */)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wrap this with a struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had a few more stats, and figured it would be clearer to have it in the stats_ struct. When I switched to having only one stat, I decided to keep the struct in case we added more again in the future (like if we add cache miss counting and such)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are string values the order gets messed up since the UTF8 plot has unique X axis values. On the other hand, if you read them in numerically, you'll need to configure the X axis to be logarithmic or else you'll get washed out by the higher values.

badxaxis
goodxaxis

@pitrou
Copy link
Member

pitrou commented Jan 11, 2022

No particular concern from me.

@jonkeane
Copy link
Member

@jonkeane we will need to make sure to set ARROW_BUILD_DETAILED_BENCHMARKS=OFF on any kind of regularly run benchmarks (conbench?).

is it possible to make the default/unset option be that they don't run? Then we won't need to make a PR against conbench (and folks running the benchmarks locally won't need to remember to set it if they don't want these).

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no concern.

@save-buffer
Copy link
Contributor Author

@jonkeane @westonpace I did set ARROW_BUILD_DETAILED_BENCHMARKS to OFF in CMakePresets.json, so I think it should default to OFF?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@save-buffer is right, my mistake, it looks like the default is OFF. Thanks for the updates. Once CI passes I will merge.

@ursabot
Copy link

ursabot commented Jan 13, 2022

Benchmark runs are scheduled for baseline = 2ec4e99 and contender = ab86daf. ab86daf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.79% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants