Skip to content

sql: introduce array_cat_agg aggregate builtin#97826

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:array-unnest-agg
Mar 7, 2023
Merged

sql: introduce array_cat_agg aggregate builtin#97826
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:array-unnest-agg

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Mar 1, 2023

This commit introduces a new array_cat_agg aggregate builtin function
that takes in an array type as its input, and then unnests each array
and appends all its elements into a single result array. In other
words, it behaves similar to array_agg(unnest(array_column)). This
function doesn't have an analogue in Postgres. However, some of our SQL
observability tools need this functionality, and the current workaround
of using a LATERAL JOIN often results in slow apply joins, so this new
builtin should speed things up significantly. In particular,
crdb_internal.statement_statistics view is now refactored to use the
new builtin which removes an apply join from it. The choice of this
particular name comes from the fact that we have the array_cat builtin
which concatenates two arrays.

Fixes: #97502.

Release note (sql change): New aggregate builtin function
array_cat_agg is introduced. It behaves similar to how
array_agg(unnest(array_column)) would - namely, it takes arrays as its
input, unnests them into the array elements which are then aggregated
into a single result array (i.e. it's similar to concatenating all input
arrays into a single one).

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the array-unnest-agg branch 2 times, most recently from d0271d2 to 62f623a Compare March 1, 2023 01:39
@yuzefovich yuzefovich marked this pull request as ready for review March 1, 2023 01:39
@yuzefovich yuzefovich requested a review from a team March 1, 2023 01:39
@yuzefovich yuzefovich requested a review from a team as a code owner March 1, 2023 01:39
@yuzefovich yuzefovich requested review from DrewKimball and rytaft March 1, 2023 01:39
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @maryliag want to make sure that this is the behavior that SQL observability wants, taking an example from #84616:

create table t (a int, b int, c STRING[]);
insert into t values (1, 1, ARRAY['a', 'b']), (1, 2, ARRAY['c', 'd']), (2, 2, ARRAY['c']), (1, 1, ARRAY['e', 'f']);
root@127.0.0.1:26257/defaultdb> select a, max(b), array_unnest_agg(c) from t group by a;                                                              
  a | max | array_unnest_agg
----+-----+-------------------
  1 |   2 | {a,b,c,d,e,f}
  2 |   2 | {c}
(2 rows)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)


-- commits line 16 at r1:
I have a slight preference for naming this function array_unnest_agg and not array_concat_agg as the issue suggests since the former seems more descriptive / intuitive to me, but I can change it if others prefer the latter.

@maryliag
Copy link
Copy Markdown
Contributor

maryliag commented Mar 1, 2023

Your example is similar to what I want, so that works, thank you!
Curious, the function combines everything even if we have duplicates? Also wanna make sure it accepts an array being null.

I can share the actual query we want to use this for. On this case we're combining the values of the index_recommendations column (you can use a simplified version of these query to test just that aggregation, but wanted to share the full thing in case you want to compare performance):

CREATE VIEW crdb_internal.statement_statistics AS
SELECT
 aggregated_ts,
 fingerprint_id,
 transaction_fingerprint_id,
 plan_hash,
 app_name,
 max(metadata) as metadata,
 crdb_internal.merge_statement_stats(array_agg(DISTINCT statistics)),
 max(sampled_plan),
 aggregation_interval,
 array_remove(array_agg(index_rec), NULL) AS index_recommendations
FROM (
 SELECT
     aggregated_ts,
     fingerprint_id,
     transaction_fingerprint_id,
     plan_hash,
     app_name,
     metadata,
     statistics,
     sampled_plan,
     aggregation_interval,
     index_recommendations
 FROM
     crdb_internal.cluster_statement_statistics
 UNION ALL
     SELECT
         aggregated_ts,
         fingerprint_id,
         transaction_fingerprint_id,
         plan_hash,
         app_name,
         metadata,
         statistics,
         plan,
         agg_interval,
         index_recommendations
     FROM
         system.statement_statistics
)
LEFT JOIN LATERAL unnest(index_recommendations) AS index_rec ON true
GROUP BY
 aggregated_ts,
 fingerprint_id,
 transaction_fingerprint_id,
 plan_hash,
 app_name,
 aggregation_interval

@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Mar 1, 2023

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I have a slight preference for naming this function array_unnest_agg and not array_concat_agg as the issue suggests since the former seems more descriptive / intuitive to me, but I can change it if others prefer the latter.

The reason I suggested array_concat_agg is because there is already a scalar function called array_concat that concatenates two arrays (and I've seen other people on the internet use the name array_concat_agg for a similar purpose). But I'm open to array_unnest_agg if you think that's better.

@yuzefovich yuzefovich changed the title sql: introduce array_unnest_agg aggregate builtin sql: introduce array_concat_agg aggregate builtin Mar 1, 2023
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, the function combines everything even if we have duplicates? Also wanna make sure it accepts an array being null.

Yes, the duplicates are kept as well as all NULLs.

Thanks for pointing this view out. I made the change to the view definition to use the new aggregate builtin (I believe it should just be a drop in replacement).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)


-- commits line 16 at r1:

Previously, rytaft (Rebecca Taft) wrote…

The reason I suggested array_concat_agg is because there is already a scalar function called array_concat that concatenates two arrays (and I've seen other people on the internet use the name array_concat_agg for a similar purpose). But I'm open to array_unnest_agg if you think that's better.

Ack, let's use array_concat_agg then.

@yuzefovich yuzefovich force-pushed the array-unnest-agg branch 2 times, most recently from 58c94e2 to 31338f1 Compare March 1, 2023 06:07
Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome!

I'll let someone from SQL Queries to review the implementation itself, but from a usage perspective on the view
:lgtm:

Reviewed 17 of 17 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Mar 1, 2023

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Ack, let's use array_concat_agg then.

Ugh I'm totally wrong -- it's array_cat, so array_cat_agg is probably what we want. Sorry for the hassle.

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)


-- commits line 16 at r1:

Previously, rytaft (Rebecca Taft) wrote…

Ugh I'm totally wrong -- it's array_cat, so array_cat_agg is probably what we want. Sorry for the hassle.

Where does this array_cat come from? On a quick search I can see that both array_concat_agg (e.g. BigQuery) and array_cat_agg (e.g. citus) are used. To me array_concat_agg seems more intuitive, and since PG doesn't support it anyway, I'd personally choose that version.

@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Mar 1, 2023

-- commits line 16 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Where does this array_cat come from? On a quick search I can see that both array_concat_agg (e.g. BigQuery) and array_cat_agg (e.g. citus) are used. To me array_concat_agg seems more intuitive, and since PG doesn't support it anyway, I'd personally choose that version.

array_cat is a Postgres function that we also support (which concatenates two arrays). But since this aggregate function doesn't exist in Postgres I guess it's up to us. I don't feel too strongly either way...

Copy link
Copy Markdown
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

func (a *arrayConcatAggregate) Add(ctx context.Context, datum tree.Datum, _ ...tree.Datum) error {
	if datum == tree.DNull {
		return a.arr.Append(datum)

DNull has a nonzero size; should we account for it here?

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for doing this!

Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

DNull has a nonzero size; should we account for it here?

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

+1, would be good to add a test that this aggregate produces the same output as using array_agg with LATERAL ... unnest

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

DNull has a nonzero size; should we account for it here?

Indeed, it's non-zero size, but we have a global singleton for it, so it is effectively free. (Side note: I wonder whether we should just change dNull.Size to return 0?)

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

I followed the behavior or array_agg which does include NULLs into the result array. In my mind this new aggregate builtin is an aggregate function firstmost, and the closest existing aggregate we have is array_agg, so it should be as close to that one as possible. This difference makes it so that array_concat_agg is different from array_agg with LATERAL ... unnest, but it's up to us how we define it. In particular, in the single view where it is now used we are removing all NULLs from the result array, so the end result will be the same.

That said, I can also see us not including NULLs into the result. Then we'd remove the call to array_remove in that view. I'd also probably then rename the aggregate to array_unnest_agg.

Let's all agree on the semantics and naming here.

@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Mar 7, 2023

pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

DNull has a nonzero size; should we account for it here?

Indeed, it's non-zero size, but we have a global singleton for it, so it is effectively free. (Side note: I wonder whether we should just change dNull.Size to return 0?)

Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.

I followed the behavior or array_agg which does include NULLs into the result array. In my mind this new aggregate builtin is an aggregate function firstmost, and the closest existing aggregate we have is array_agg, so it should be as close to that one as possible. This difference makes it so that array_concat_agg is different from array_agg with LATERAL ... unnest, but it's up to us how we define it. In particular, in the single view where it is now used we are removing all NULLs from the result array, so the end result will be the same.

That said, I can also see us not including NULLs into the result. Then we'd remove the call to array_remove in that view. I'd also probably then rename the aggregate to array_unnest_agg.

Let's all agree on the semantics and naming here.

I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match array_agg). But if the input array itself is NULL, it seems odd to me to treat it the same way as if an element were NULL.

e.g., given the following rows:

[1, 2, 3]
[NULL, 4]
NULL
[5]

I would expect the following output (i.e., with one NULL, not two):

[1, 2, 3, NULL, 4, 5]

But feel free to overrule me if you're convinced otherwise...

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match array_agg). But if the input array itself is NULL, it seems odd to me to treat it the same way as if an element were NULL.

e.g., given the following rows:

[1, 2, 3]
[NULL, 4]
NULL
[5]

I would expect the following output (i.e., with one NULL, not two):

[1, 2, 3, NULL, 4, 5]

But feel free to overrule me if you're convinced otherwise...

I was wrong, this sounds very reasonable, just made this change.

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

@DrewKimball
Copy link
Copy Markdown
Collaborator

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

Wow, so much pressure... I guess I'm in favor of array_cat_agg for consistency, but I wish array_cat was just array_concat instead.

@yuzefovich yuzefovich changed the title sql: introduce array_concat_agg aggregate builtin sql: introduce array_cat_agg aggregate builtin Mar 7, 2023
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I was wrong, this sounds very reasonable, just made this change.

Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

@DrewKimball
Copy link
Copy Markdown
Collaborator

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

I'm happy. Thanks for doing this!

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: thanks for all the iterations on this

Reviewed 3 of 3 files at r4, 2 of 2 files at r5, 17 of 17 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @yuzefovich)


pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)

I think the current tests are enough (although I added one small suggestion). Thanks!


pkg/sql/logictest/testdata/logic_test/aggregate line 194 at r6 (raw file):

(6, 2, 3, 'b', '1ms', '{6, 2, 3}'),
(7, 2, 2, 'b', '4 days', '{7, 2, 2}'),
(8, 4, 2, 'A', '3 years', '{8, 4, 2}')

nit: consider adding a NULL element inside one of these arrays.

This commit introduces a new `array_cat_agg` aggregate builtin function
that takes in an array type as its input, and then unnests each array
and appends all its elements into a single result array. In other
words, it behaves similar to `array_agg(unnest(array_column))`. This
function doesn't have an analogue in Postgres. However, some of our SQL
observability tools need this functionality, and the current workaround
of using a LATERAL JOIN often results in slow apply joins, so this new
builtin should speed things up significantly. In particular,
`crdb_internal.statement_statistics` view is now refactored to use the
new builtin which removes an apply join from it. The choice of this
particular name comes from the fact that we have the `array_cat` builtin
which concatenates two arrays.

Release note (sql change): New aggregate builtin function
`array_cat_agg` is introduced. It behaves similar to how
`array_agg(unnest(array_column))` would - namely, it takes arrays as its
input, unnests them into the array elements which are then aggregated
into a single result array (i.e. it's similar to concatenating all input
arrays into a single one).
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 7, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 7, 2023

Build succeeded:

@craig craig bot merged commit cb36e3b into cockroachdb:master Mar 7, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from cb426a6 to blathers/backport-release-22.2-97826: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: create new aggregate function array_concat_agg

5 participants