sql: introduce array_cat_agg aggregate builtin by yuzefovich · Pull Request #97826 · cockroachdb/cockroach

yuzefovich · 2023-03-01T00:43:44Z

This commit introduces a new array_cat_agg aggregate builtin function
that takes in an array type as its input, and then unnests each array
and appends all its elements into a single result array. In other
words, it behaves similar to array_agg(unnest(array_column)). This
function doesn't have an analogue in Postgres. However, some of our SQL
observability tools need this functionality, and the current workaround
of using a LATERAL JOIN often results in slow apply joins, so this new
builtin should speed things up significantly. In particular,
crdb_internal.statement_statistics view is now refactored to use the
new builtin which removes an apply join from it. The choice of this
particular name comes from the fact that we have the array_cat builtin
which concatenates two arrays.

Fixes: #97502.

Release note (sql change): New aggregate builtin function
array_cat_agg is introduced. It behaves similar to how
array_agg(unnest(array_column)) would - namely, it takes arrays as its
input, unnests them into the array elements which are then aggregated
into a single result array (i.e. it's similar to concatenating all input
arrays into a single one).

cockroach-teamcity · 2023-03-01T00:43:52Z

This change is

yuzefovich

cc @maryliag want to make sure that this is the behavior that SQL observability wants, taking an example from #84616:

create table t (a int, b int, c STRING[]);
insert into t values (1, 1, ARRAY['a', 'b']), (1, 2, ARRAY['c', 'd']), (2, 2, ARRAY['c']), (1, 1, ARRAY['e', 'f']);

root@127.0.0.1:26257/defaultdb> select a, max(b), array_unnest_agg(c) from t group by a;                                                              
  a | max | array_unnest_agg
----+-----+-------------------
  1 |   2 | {a,b,c,d,e,f}
  2 |   2 | {c}
(2 rows)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

-- commits line 16 at r1:
I have a slight preference for naming this function array_unnest_agg and not array_concat_agg as the issue suggests since the former seems more descriptive / intuitive to me, but I can change it if others prefer the latter.

maryliag · 2023-03-01T01:56:31Z

Your example is similar to what I want, so that works, thank you!
Curious, the function combines everything even if we have duplicates? Also wanna make sure it accepts an array being null.

I can share the actual query we want to use this for. On this case we're combining the values of the index_recommendations column (you can use a simplified version of these query to test just that aggregation, but wanted to share the full thing in case you want to compare performance):

CREATE VIEW crdb_internal.statement_statistics AS
SELECT
 aggregated_ts,
 fingerprint_id,
 transaction_fingerprint_id,
 plan_hash,
 app_name,
 max(metadata) as metadata,
 crdb_internal.merge_statement_stats(array_agg(DISTINCT statistics)),
 max(sampled_plan),
 aggregation_interval,
 array_remove(array_agg(index_rec), NULL) AS index_recommendations
FROM (
 SELECT
     aggregated_ts,
     fingerprint_id,
     transaction_fingerprint_id,
     plan_hash,
     app_name,
     metadata,
     statistics,
     sampled_plan,
     aggregation_interval,
     index_recommendations
 FROM
     crdb_internal.cluster_statement_statistics
 UNION ALL
     SELECT
         aggregated_ts,
         fingerprint_id,
         transaction_fingerprint_id,
         plan_hash,
         app_name,
         metadata,
         statistics,
         plan,
         agg_interval,
         index_recommendations
     FROM
         system.statement_statistics
)
LEFT JOIN LATERAL unnest(index_recommendations) AS index_rec ON true
GROUP BY
 aggregated_ts,
 fingerprint_id,
 transaction_fingerprint_id,
 plan_hash,
 app_name,
 aggregation_interval

rytaft · 2023-03-01T01:56:59Z

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I have a slight preference for naming this function array_unnest_agg and not array_concat_agg as the issue suggests since the former seems more descriptive / intuitive to me, but I can change it if others prefer the latter.

The reason I suggested array_concat_agg is because there is already a scalar function called array_concat that concatenates two arrays (and I've seen other people on the internet use the name array_concat_agg for a similar purpose). But I'm open to array_unnest_agg if you think that's better.

yuzefovich

Curious, the function combines everything even if we have duplicates? Also wanna make sure it accepts an array being null.

Yes, the duplicates are kept as well as all NULLs.

Thanks for pointing this view out. I made the change to the view definition to use the new aggregate builtin (I believe it should just be a drop in replacement).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

-- commits line 16 at r1:

Previously, rytaft (Rebecca Taft) wrote…

The reason I suggested array_concat_agg is because there is already a scalar function called array_concat that concatenates two arrays (and I've seen other people on the internet use the name array_concat_agg for a similar purpose). But I'm open to array_unnest_agg if you think that's better.

Ack, let's use array_concat_agg then.

maryliag

That's awesome!

I'll let someone from SQL Queries to review the implementation itself, but from a usage perspective on the view

Reviewed 17 of 17 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

rytaft · 2023-03-01T19:34:58Z

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Ack, let's use array_concat_agg then.

Ugh I'm totally wrong -- it's array_cat, so array_cat_agg is probably what we want. Sorry for the hassle.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)

-- commits line 16 at r1:

Previously, rytaft (Rebecca Taft) wrote…

Ugh I'm totally wrong -- it's array_cat, so array_cat_agg is probably what we want. Sorry for the hassle.

Where does this array_cat come from? On a quick search I can see that both array_concat_agg (e.g. BigQuery) and array_cat_agg (e.g. citus) are used. To me array_concat_agg seems more intuitive, and since PG doesn't support it anyway, I'd personally choose that version.

rytaft · 2023-03-01T22:05:19Z

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Where does this array_cat come from? On a quick search I can see that both array_concat_agg (e.g. BigQuery) and array_cat_agg (e.g. citus) are used. To me array_concat_agg seems more intuitive, and since PG doesn't support it anyway, I'd personally choose that version.

array_cat is a Postgres function that we also support (which concatenates two arrays). But since this aggregate function doesn't exist in Postgres I guess it's up to us. I don't feel too strongly either way...

DrewKimball

Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

func (a *arrayConcatAggregate) Add(ctx context.Context, datum tree.Datum, _ ...tree.Datum) error {
	if datum == tree.DNull {
		return a.arr.Append(datum)

DNull has a nonzero size; should we account for it here?

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

rytaft

Awesome! Thanks for doing this!

Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

DNull has a nonzero size; should we account for it here?

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

+1, would be good to add a test that this aggregate produces the same output as using array_agg with LATERAL ... unnest

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

DNull has a nonzero size; should we account for it here?

Indeed, it's non-zero size, but we have a global singleton for it, so it is effectively free. (Side note: I wonder whether we should just change dNull.Size to return 0?)

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

I followed the behavior or array_agg which does include NULLs into the result array. In my mind this new aggregate builtin is an aggregate function firstmost, and the closest existing aggregate we have is array_agg, so it should be as close to that one as possible. This difference makes it so that array_concat_agg is different from array_agg with LATERAL ... unnest, but it's up to us how we define it. In particular, in the single view where it is now used we are removing all NULLs from the result array, so the end result will be the same.

That said, I can also see us not including NULLs into the result. Then we'd remove the call to array_remove in that view. I'd also probably then rename the aggregate to array_unnest_agg.

Let's all agree on the semantics and naming here.

rytaft · 2023-03-07T03:23:20Z

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

DNull has a nonzero size; should we account for it here?

Indeed, it's non-zero size, but we have a global singleton for it, so it is effectively free. (Side note: I wonder whether we should just change dNull.Size to return 0?)

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

I followed the behavior or array_agg which does include NULLs into the result array. In my mind this new aggregate builtin is an aggregate function firstmost, and the closest existing aggregate we have is array_agg, so it should be as close to that one as possible. This difference makes it so that array_concat_agg is different from array_agg with LATERAL ... unnest, but it's up to us how we define it. In particular, in the single view where it is now used we are removing all NULLs from the result array, so the end result will be the same.

That said, I can also see us not including NULLs into the result. Then we'd remove the call to array_remove in that view. I'd also probably then rename the aggregate to array_unnest_agg.

Let's all agree on the semantics and naming here.

I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match array_agg). But if the input array itself is NULL, it seems odd to me to treat it the same way as if an element were NULL.

e.g., given the following rows:

[1, 2, 3]
[NULL, 4]
NULL
[5]

I would expect the following output (i.e., with one NULL, not two):

[1, 2, 3, NULL, 4, 5]

But feel free to overrule me if you're convinced otherwise...

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match array_agg). But if the input array itself is NULL, it seems odd to me to treat it the same way as if an element were NULL.

e.g., given the following rows:
[1, 2, 3]
[NULL, 4]
NULL
[5]
I would expect the following output (i.e., with one NULL, not two):
[1, 2, 3, NULL, 4, 5]
But feel free to overrule me if you're convinced otherwise...

I was wrong, this sounds very reasonable, just made this change.

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

DrewKimball · 2023-03-07T03:45:04Z

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

Wow, so much pressure... I guess I'm in favor of array_cat_agg for consistency, but I wish array_cat was just array_concat instead.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I was wrong, this sounds very reasonable, just made this change.

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

DrewKimball · 2023-03-07T06:30:55Z

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

I'm happy. Thanks for doing this!

rytaft

thanks for all the iterations on this

Reviewed 3 of 3 files at r4, 2 of 2 files at r5, 17 of 17 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @yuzefovich)

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

I think the current tests are enough (although I added one small suggestion). Thanks!

pkg/sql/logictest/testdata/logic_test/aggregate line 194 at r6 (raw file):

(6, 2, 3, 'b', '1ms', '{6, 2, 3}'),
(7, 2, 2, 'b', '4 days', '{7, 2, 2}'),
(8, 4, 2, 'A', '3 years', '{8, 4, 2}')

nit: consider adding a NULL element inside one of these arrays.

This commit introduces a new `array_cat_agg` aggregate builtin function that takes in an array type as its input, and then unnests each array and appends all its elements into a single result array. In other words, it behaves similar to `array_agg(unnest(array_column))`. This function doesn't have an analogue in Postgres. However, some of our SQL observability tools need this functionality, and the current workaround of using a LATERAL JOIN often results in slow apply joins, so this new builtin should speed things up significantly. In particular, `crdb_internal.statement_statistics` view is now refactored to use the new builtin which removes an apply join from it. The choice of this particular name comes from the fact that we have the `array_cat` builtin which concatenates two arrays. Release note (sql change): New aggregate builtin function `array_cat_agg` is introduced. It behaves similar to how `array_agg(unnest(array_column))` would - namely, it takes arrays as its input, unnests them into the array elements which are then aggregated into a single result array (i.e. it's similar to concatenating all input arrays into a single one).

yuzefovich

TFTRs!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)

craig · 2023-03-07T18:41:05Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-07T19:51:18Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-07T19:51:25Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from cb426a6 to blathers/backport-release-22.2-97826: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich force-pushed the array-unnest-agg branch 2 times, most recently from d0271d2 to 62f623a Compare March 1, 2023 01:39

yuzefovich added backport-22.1.x labels Mar 1, 2023

yuzefovich marked this pull request as ready for review March 1, 2023 01:39

yuzefovich requested a review from a team March 1, 2023 01:39

yuzefovich requested a review from a team as a code owner March 1, 2023 01:39

yuzefovich requested review from DrewKimball and rytaft March 1, 2023 01:39

yuzefovich commented Mar 1, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from 62f623a to 912d8cb Compare March 1, 2023 03:15

yuzefovich changed the title ~~sql: introduce array_unnest_agg aggregate builtin~~ sql: introduce array_concat_agg aggregate builtin Mar 1, 2023

yuzefovich removed the backport-22.1.x label Mar 1, 2023

yuzefovich commented Mar 1, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch 2 times, most recently from 58c94e2 to 31338f1 Compare March 1, 2023 06:07

maryliag approved these changes Mar 1, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from 31338f1 to 3ac36ff Compare March 1, 2023 16:48

yuzefovich commented Mar 1, 2023

View reviewed changes

DrewKimball reviewed Mar 2, 2023

View reviewed changes

rytaft reviewed Mar 2, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from 3ac36ff to ed3f25e Compare March 5, 2023 21:31

yuzefovich commented Mar 5, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from ed3f25e to 4b68284 Compare March 7, 2023 03:32

yuzefovich commented Mar 7, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from 4b68284 to 8f47b74 Compare March 7, 2023 04:51

yuzefovich changed the title ~~sql: introduce array_concat_agg aggregate builtin~~ sql: introduce array_cat_agg aggregate builtin Mar 7, 2023

yuzefovich commented Mar 7, 2023

View reviewed changes

rytaft approved these changes Mar 7, 2023

View reviewed changes

yuzefovich force-pushed the array-unnest-agg branch from 8f47b74 to cb426a6 Compare March 7, 2023 16:56

yuzefovich commented Mar 7, 2023

View reviewed changes

craig bot merged commit cb36e3b into cockroachdb:master Mar 7, 2023

yuzefovich deleted the array-unnest-agg branch March 7, 2023 20:22

yuzefovich mentioned this pull request Mar 7, 2023

release-22.2: sql: introduce array_cat_agg aggregate builtin #98171

Merged

cockroach-teamcity mentioned this pull request Mar 8, 2023

PR #97826 - sql: introduce array_cat_agg aggregate builtin cockroachdb/docs#16438

Closed

Conversation

yuzefovich commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 1, 2023

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

maryliag commented Mar 1, 2023

Uh oh!

rytaft commented Mar 1, 2023

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

maryliag left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Mar 1, 2023

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Mar 1, 2023

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Mar 7, 2023

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

DrewKimball commented Mar 7, 2023

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

DrewKimball commented Mar 7, 2023

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Mar 7, 2023

Uh oh!

craig bot commented Mar 7, 2023

Uh oh!

blathers-crl bot commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuzefovich commented Mar 1, 2023 •

edited

Loading