Skip to content

Conversation

@davidhewitt
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR indirectly addresses #14533 not by actually changing the array_has evaluation but instead by simplifying it to the equivalent InList expression where the haystack is not varying per-row.

The array_has udf has to operate row-by-row because it may have a varying haystack. The InList expression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me list InList also supports some kind of Set optimization.

What changes are included in this PR?

Add a simplify implementation to array_has UDF which will produce an InList expr when the haystack is a literal list.

Are these changes tested?

Yes, see test additions in the diff.

I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others.

> CREATE TABLE test AS (SELECT substr(md5(i)::text, 1, 32) as haystack FROM generate_series(1, 100000) t(i));
0 row(s) fetched. 
Elapsed 0.027 seconds.

> SELECT * FROM test limit 1;
+----------------------------------+
| haystack                         |
+----------------------------------+
| 7f4b18de3cfeb9b4ac78c381ee2ad278 |
+----------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE haystack = '7f4b18de3cfeb9b4ac78c381ee2ad278';
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> SELECT count(*) FROM test WHERE haystack IN ('7f4b18de3cfeb9b4ac78c381ee2ad278');
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

> SELECT count(*) FROM test WHERE haystack = ANY(['7f4b18de3cfeb9b4ac78c381ee2ad278']);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE array_has(['7f4b18de3cfeb9b4ac78c381ee2ad278'], haystack);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it.

Are there any user-facing changes?

Simplification results will change.

Comment on lines 157 to 158
// TODO: support LargeList / FixedSizeList?
// (not supported by `convert_array_to_scalar_vec`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input wanted on whether this should be resolved (maybe by handling those data types in convert_aray_to_scalar_vec).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest doing the work in a follow on ticket / PR. So that would mean:

  1. Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)
  2. leaving a TODO with a link to that ticket here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #15389

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidhewitt -- this looks great to me other than needing some end to end .slt tests

Ideally the tests would

  1. Run the queries in #14533 (or point to where they are already run)
  2. Do an EXPLAIN ... that shows the rewrite happening

The rationale is to make sure that this code is hooked up correctly when running queries

It looks to me list InList also supports some kind of Set optimization.

Yes the inlist uses a hash table for larger element needles

Comment on lines 157 to 158
// TODO: support LargeList / FixedSizeList?
// (not supported by `convert_array_to_scalar_vec`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest doing the work in a follow on ticket / PR. So that would mean:

  1. Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)
  2. leaving a TODO with a link to that ticket here

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 24, 2025
Comment on lines +6119 to +6143
# TODO: this should probably be possible to completely remove the filter as always true?
query TT
explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
select count(*) from test WHERE array_has([needle], needle);
----
logical_plan
01)Projection: count(Int64(1)) AS count(*)
02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
03)----SubqueryAlias: test
04)------SubqueryAlias: t
05)--------Projection:
06)----------Filter: __common_expr_3 = __common_expr_3
07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3
08)--------------TableScan: tmp_table projection=[value]
physical_plan
01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0
08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3]
09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably #15387

@davidhewitt
Copy link
Contributor Author

Thanks, added .slt files and at the same time supported simplifying of make_array UDF by the same reasoning.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @davidhewitt -- this looks good to me

})));
}
} else if let Expr::ScalarFunction(ScalarFunction { func, args }) = haystack {
// make_array has a static set of arguments, so we can pull the arguments out from it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that during constant evaluation make_array would be turned into a literal so this case would be unecessary

However, you wouldn't observe that simplification happening in unit tests (only in the slt tests when everything was put together)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested removing this case and the slt tests failed like this

Completed 113 test files in 3 seconds                                                                                                                                 External error: query result mismatch:
[SQL] explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
select count(*) from test WHERE array_has([needle], needle);
[Diff] (-expected|+actual)
    logical_plan
    01)Projection: count(Int64(1)) AS count(*)
    02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
    03)----SubqueryAlias: test
    04)------SubqueryAlias: t
-   05)--------Projection:
-   06)----------Filter: __common_expr_3 = __common_expr_3
+   05)--------Projection:
+   06)----------Filter: array_has(make_array(__common_expr_3), __common_expr_3)
    07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3
    08)--------------TableScan: tmp_table projection=[value]
    physical_plan
    01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
    02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
    03)----CoalescePartitionsExec
    04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
    05)--------ProjectionExec: expr=[]
    06)----------CoalesceBatchesExec: target_batch_size=8192
-   07)------------FilterExec: __common_expr_3@0 = __common_expr_3@0
+   07)------------FilterExec: array_has(make_array(__common_expr_3@0), __common_expr_3@0)
    08)--------------ProjectionExec: expr=[substr(md5(CAST(value@0 AS Utf8)), 1, 32) as __common_expr_3]
    09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
    10)------------------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=100000, batch_size=8192]
at test_files/array.slt:6120

I found that unexpected but don't have time to look into it more now

#true false true false false false true true false false true false true

# rewrite various array_has operations to InList where the haystack is a literal list
# NB that `col in (a, b, c)` is simplified to OR if there are <= 3 elements, so we make 4-element haystack lists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

03)----SubqueryAlias: test
04)------SubqueryAlias: t
05)--------Projection:
06)----------Filter: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) IN ([Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), Utf8View("a"), Utf8View("b"), Utf8View("c")])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is cool to see

@alamb alamb merged commit 0a2e422 into apache:main Mar 25, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 25, 2025

Thanks again @davidhewitt

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025
…pache#15354)

* simplify `array_has` UDF to `InList` expr when haystack is constant

* add `.slt` tests, also simplify with `make_array`

* tweak comment

* add test for `make_array` arg simplification
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…pache#15354)

* simplify `array_has` UDF to `InList` expr when haystack is constant

* add `.slt` tests, also simplify with `make_array`

* tweak comment

* add test for `make_array` arg simplification
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_has UDF performance is slow for smaller number of needles

2 participants