-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Coalesce batches inside hash join, reuse indices buffer #18972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb let's fire some benchmarks? |
|
🤖 |
I started some. It would be really nice to find some way for others to run their own -- I don't think it is great I am often the bottleneck. Maybe I'll see if I can get an LLM to write me a script to do that... |
|
run benchmarks |
|
🤖 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch_mem |
|
🤖 |
|
🤖: Benchmark completed Details
|
You GH repo is public. I copied your scripts locally and slightly modified them for my env but I'm sure anyone should be able to do it. |
Indeed! I also indulged in some additional scripting to launch things from PRs -- #18115 (comment) I'll maybe add some additional well known contributors too |
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#18972 (comment)).
|
Yes the benchmarks can be run (and I usually do as part of PRs), but it is always nice to run on different environments to verify / compare. Nice to automate this stuff 🚀 |
|
@milenkovicm perhaps you have time for review? |
|
will try over weekend if not too late |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan this change makes sense to me, having said that, I'm not an expert in HJ :) I have put few minor comments, hopefully they make sense
|
Thanks for the review @milenkovicm |
|
Love it! |
| CoalesceBatchesExec: target_batch_size=4096 | ||
| RepartitionExec: partitioning=Hash([c2@0], 9000), input_partitions=1 | ||
| DataSourceExec: file_groups={1 group: [[ARROW_TEST_DATA/csv/aggregate_test_100.csv]]}, projection=[c1@0 as c2], file_type=csv, has_header=true | ||
| HashJoinExec: mode=Partitioned, join_type=Inner, on=[(c1@0, c2@0)], projection=[c1@0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

Which issue does this PR close?
BatchCoalescerintoHashJoinExecand remove fromCoalesceBatchesoptimization rule #18781Rationale for this change
Improves the plan / plan readability for queries with joins.
Seems to improve perf as well a bit for more challenging joins: (TPC-H SF=10, in-memory)
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?