simplify `array_has` UDF to `InList` expr when haystack is constant #15354

davidhewitt · 2025-03-21T21:48:34Z

Which issue does this PR close?

Closes array_has UDF performance is slow for smaller number of needles #14533

Rationale for this change

This PR indirectly addresses #14533 not by actually changing the array_has evaluation but instead by simplifying it to the equivalent InList expression where the haystack is not varying per-row.

The array_has udf has to operate row-by-row because it may have a varying haystack. The InList expression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me list InList also supports some kind of Set optimization.

What changes are included in this PR?

Add a simplify implementation to array_has UDF which will produce an InList expr when the haystack is a literal list.

Are these changes tested?

Yes, see test additions in the diff.

I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others.

> CREATE TABLE test AS (SELECT substr(md5(i)::text, 1, 32) as haystack FROM generate_series(1, 100000) t(i));
0 row(s) fetched. 
Elapsed 0.027 seconds.

> SELECT * FROM test limit 1;
+----------------------------------+
| haystack                         |
+----------------------------------+
| 7f4b18de3cfeb9b4ac78c381ee2ad278 |
+----------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE haystack = '7f4b18de3cfeb9b4ac78c381ee2ad278';
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> SELECT count(*) FROM test WHERE haystack IN ('7f4b18de3cfeb9b4ac78c381ee2ad278');
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

> SELECT count(*) FROM test WHERE haystack = ANY(['7f4b18de3cfeb9b4ac78c381ee2ad278']);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE array_has(['7f4b18de3cfeb9b4ac78c381ee2ad278'], haystack);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it.

Are there any user-facing changes?

Simplification results will change.

davidhewitt · 2025-03-21T21:56:05Z

datafusion/functions-nested/src/array_has.rs

+        // TODO: support LargeList / FixedSizeList?
+        // (not supported by `convert_array_to_scalar_vec`)


Input wanted on whether this should be resolved (maybe by handling those data types in convert_aray_to_scalar_vec).

I suggest doing the work in a follow on ticket / PR. So that would mean:

Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)

leaving a TODO with a link to that ticket here

Filed as #15389

alamb

Thanks @davidhewitt -- this looks great to me other than needing some end to end .slt tests

Ideally the tests would

Run the queries in #14533 (or point to where they are already run)
Do an EXPLAIN ... that shows the rewrite happening

The rationale is to make sure that this code is hooked up correctly when running queries

It looks to me list InList also supports some kind of Set optimization.

Yes the inlist uses a hash table for larger element needles

alamb · 2025-03-22T14:37:56Z

datafusion/functions-nested/src/array_has.rs

+        // TODO: support LargeList / FixedSizeList?
+        // (not supported by `convert_array_to_scalar_vec`)


I suggest doing the work in a follow on ticket / PR. So that would mean:

Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)

leaving a TODO with a link to that ticket here

davidhewitt · 2025-03-24T11:53:11Z

datafusion/sqllogictest/test_files/array.slt

+# TODO: this should probably be possible to completely remove the filter as always true?
+query TT
+explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
+select count(*) from test WHERE array_has([needle], needle);
+----
+logical_plan
+01)Projection: count(Int64(1)) AS count(*)
+02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
+03)----SubqueryAlias: test
+04)------SubqueryAlias: t
+05)--------Projection:
+06)----------Filter: __common_expr_3 = __common_expr_3
+07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3
+08)--------------TableScan: tmp_table projection=[value]
+physical_plan
+01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
+02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
+03)----CoalescePartitionsExec
+04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
+05)--------ProjectionExec: expr=[]
+06)----------CoalesceBatchesExec: target_batch_size=8192
+07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0
+08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3]
+09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192]


I think this is probably #15387

davidhewitt · 2025-03-24T11:54:04Z

Thanks, added .slt files and at the same time supported simplifying of make_array UDF by the same reasoning.

alamb

Thank you @davidhewitt -- this looks good to me

alamb · 2025-03-24T19:04:48Z

datafusion/functions-nested/src/array_has.rs

+                })));
+            }
+        } else if let Expr::ScalarFunction(ScalarFunction { func, args }) = haystack {
+            // make_array has a static set of arguments, so we can pull the arguments out from it


I would expect that during constant evaluation make_array would be turned into a literal so this case would be unecessary

However, you wouldn't observe that simplification happening in unit tests (only in the slt tests when everything was put together)

I tested removing this case and the slt tests failed like this

Completed 113 test files in 3 seconds External error: query result mismatch: [SQL] explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i)) select count(*) from test WHERE array_has([needle], needle); [Diff] (-expected|+actual) logical_plan 01)Projection: count(Int64(1)) AS count(*) 02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] 03)----SubqueryAlias: test 04)------SubqueryAlias: t - 05)--------Projection: - 06)----------Filter: __common_expr_3 = __common_expr_3 + 05)--------Projection: + 06)----------Filter: array_has(make_array(__common_expr_3), __common_expr_3) 07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3 08)--------------TableScan: tmp_table projection=[value] physical_plan 01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)] 02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] 03)----CoalescePartitionsExec 04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] 05)--------ProjectionExec: expr=[] 06)----------CoalesceBatchesExec: target_batch_size=8192 - 07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0 + 07)------------FilterExec: array_has(make_array(__common_expr_3@0), __common_expr_3@0) 08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3] 09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192] at test_files/array.slt:6120

I found that unexpected but don't have time to look into it more now

alamb · 2025-03-24T19:06:19Z

datafusion/sqllogictest/test_files/array.slt

 #true false true false false false true true false false true false true

+# rewrite various array_has operations to InList where the haystack is a literal list
+# NB that `col in (a, b, c)` is simplified to OR if there are <= 3 elements, so we make 4-element haystack lists


alamb · 2025-03-24T19:07:09Z

datafusion/sqllogictest/test_files/array.slt

+03)----SubqueryAlias: test
+04)------SubqueryAlias: t
+05)--------Projection:
+06)----------Filter: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) IN ([Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), Utf8View("a"), Utf8View("b"), Utf8View("c")])


that is cool to see

alamb · 2025-03-25T20:42:52Z

Thanks again @davidhewitt

…pache#15354) * simplify `array_has` UDF to `InList` expr when haystack is constant * add `.slt` tests, also simplify with `make_array` * tweak comment * add test for `make_array` arg simplification

simplify array_has UDF to InList expr when haystack is constant

57bc86c

davidhewitt commented Mar 21, 2025

View reviewed changes

alamb reviewed Mar 22, 2025

View reviewed changes

add .slt tests, also simplify with make_array

3a2e170

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 24, 2025

tweak comment

aed4795

davidhewitt mentioned this pull request Mar 24, 2025

Perf: simplify LargeList constant haystack in array_has to InList expr #15389

Closed

davidhewitt commented Mar 24, 2025

View reviewed changes

add test for make_array arg simplification

72b311b

alamb approved these changes Mar 24, 2025

View reviewed changes

alamb merged commit 0a2e422 into apache:main Mar 25, 2025
27 checks passed

		// TODO: support LargeList / FixedSizeList?
		// (not supported by `convert_array_to_scalar_vec`)

simplify array_has UDF to InList expr when haystack is constant #15354

simplify array_has UDF to InList expr when haystack is constant #15354

Uh oh!

Conversation

davidhewitt commented Mar 21, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidhewitt commented Mar 24, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simplify `array_has` UDF to `InList` expr when haystack is constant #15354

simplify `array_has` UDF to `InList` expr when haystack is constant #15354