-
Notifications
You must be signed in to change notification settings - Fork 1.9k
simplify array_has UDF to InList expr when haystack is constant
#15354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // TODO: support LargeList / FixedSizeList? | ||
| // (not supported by `convert_array_to_scalar_vec`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input wanted on whether this should be resolved (maybe by handling those data types in convert_aray_to_scalar_vec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest doing the work in a follow on ticket / PR. So that would mean:
- Filing a ticket with a reproducer (perhaps modified from
array_hasUDF performance is slow for smaller number of needles #14533) - leaving a TODO with a link to that ticket here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #15389
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davidhewitt -- this looks great to me other than needing some end to end .slt tests
Ideally the tests would
- Run the queries in #14533 (or point to where they are already run)
- Do an
EXPLAIN ...that shows the rewrite happening
The rationale is to make sure that this code is hooked up correctly when running queries
It looks to me list InList also supports some kind of Set optimization.
Yes the inlist uses a hash table for larger element needles
| // TODO: support LargeList / FixedSizeList? | ||
| // (not supported by `convert_array_to_scalar_vec`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest doing the work in a follow on ticket / PR. So that would mean:
- Filing a ticket with a reproducer (perhaps modified from
array_hasUDF performance is slow for smaller number of needles #14533) - leaving a TODO with a link to that ticket here
| # TODO: this should probably be possible to completely remove the filter as always true? | ||
| query TT | ||
| explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i)) | ||
| select count(*) from test WHERE array_has([needle], needle); | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: count(Int64(1)) AS count(*) | ||
| 02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] | ||
| 03)----SubqueryAlias: test | ||
| 04)------SubqueryAlias: t | ||
| 05)--------Projection: | ||
| 06)----------Filter: __common_expr_3 = __common_expr_3 | ||
| 07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3 | ||
| 08)--------------TableScan: tmp_table projection=[value] | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)] | ||
| 02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] | ||
| 03)----CoalescePartitionsExec | ||
| 04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] | ||
| 05)--------ProjectionExec: expr=[] | ||
| 06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
| 07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0 | ||
| 08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3] | ||
| 09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
| 10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably #15387
|
Thanks, added |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @davidhewitt -- this looks good to me
| }))); | ||
| } | ||
| } else if let Expr::ScalarFunction(ScalarFunction { func, args }) = haystack { | ||
| // make_array has a static set of arguments, so we can pull the arguments out from it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that during constant evaluation make_array would be turned into a literal so this case would be unecessary
However, you wouldn't observe that simplification happening in unit tests (only in the slt tests when everything was put together)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested removing this case and the slt tests failed like this
Completed 113 test files in 3 seconds External error: query result mismatch:
[SQL] explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
select count(*) from test WHERE array_has([needle], needle);
[Diff] (-expected|+actual)
logical_plan
01)Projection: count(Int64(1)) AS count(*)
02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
03)----SubqueryAlias: test
04)------SubqueryAlias: t
- 05)--------Projection:
- 06)----------Filter: __common_expr_3 = __common_expr_3
+ 05)--------Projection:
+ 06)----------Filter: array_has(make_array(__common_expr_3), __common_expr_3)
07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3
08)--------------TableScan: tmp_table projection=[value]
physical_plan
01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
- 07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0
+ 07)------------FilterExec: array_has(make_array(__common_expr_3@0), __common_expr_3@0)
08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3]
09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192]
at test_files/array.slt:6120I found that unexpected but don't have time to look into it more now
| #true false true false false false true true false false true false true | ||
|
|
||
| # rewrite various array_has operations to InList where the haystack is a literal list | ||
| # NB that `col in (a, b, c)` is simplified to OR if there are <= 3 elements, so we make 4-element haystack lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| 03)----SubqueryAlias: test | ||
| 04)------SubqueryAlias: t | ||
| 05)--------Projection: | ||
| 06)----------Filter: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) IN ([Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), Utf8View("a"), Utf8View("b"), Utf8View("c")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is cool to see
|
Thanks again @davidhewitt |
…pache#15354) * simplify `array_has` UDF to `InList` expr when haystack is constant * add `.slt` tests, also simplify with `make_array` * tweak comment * add test for `make_array` arg simplification
…pache#15354) * simplify `array_has` UDF to `InList` expr when haystack is constant * add `.slt` tests, also simplify with `make_array` * tweak comment * add test for `make_array` arg simplification
Which issue does this PR close?
array_hasUDF performance is slow for smaller number of needles #14533Rationale for this change
This PR indirectly addresses #14533 not by actually changing the
array_hasevaluation but instead by simplifying it to the equivalentInListexpression where the haystack is not varying per-row.The
array_hasudf has to operate row-by-row because it may have a varying haystack. TheInListexpression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me listInListalso supports some kind ofSetoptimization.What changes are included in this PR?
Add a
simplifyimplementation toarray_hasUDF which will produce anInListexpr when the haystack is a literal list.Are these changes tested?
Yes, see test additions in the diff.
I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others.
I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it.
Are there any user-facing changes?
Simplification results will change.