sql: introduce array_cat_agg aggregate builtin#97826
sql: introduce array_cat_agg aggregate builtin#97826craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
d0271d2 to
62f623a
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
cc @maryliag want to make sure that this is the behavior that SQL observability wants, taking an example from #84616:
create table t (a int, b int, c STRING[]);
insert into t values (1, 1, ARRAY['a', 'b']), (1, 2, ARRAY['c', 'd']), (2, 2, ARRAY['c']), (1, 1, ARRAY['e', 'f']);root@127.0.0.1:26257/defaultdb> select a, max(b), array_unnest_agg(c) from t group by a;
a | max | array_unnest_agg
----+-----+-------------------
1 | 2 | {a,b,c,d,e,f}
2 | 2 | {c}
(2 rows)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)
-- commits line 16 at r1:
I have a slight preference for naming this function array_unnest_agg and not array_concat_agg as the issue suggests since the former seems more descriptive / intuitive to me, but I can change it if others prefer the latter.
|
Your example is similar to what I want, so that works, thank you! I can share the actual query we want to use this for. On this case we're combining the values of the index_recommendations column (you can use a simplified version of these query to test just that aggregation, but wanted to share the full thing in case you want to compare performance): |
Previously, yuzefovich (Yahor Yuzefovich) wrote…
The reason I suggested |
62f623a to
912d8cb
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Curious, the function combines everything even if we have duplicates? Also wanna make sure it accepts an array being null.
Yes, the duplicates are kept as well as all NULLs.
Thanks for pointing this view out. I made the change to the view definition to use the new aggregate builtin (I believe it should just be a drop in replacement).
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)
Previously, rytaft (Rebecca Taft) wrote…
The reason I suggested
array_concat_aggis because there is already a scalar function calledarray_concatthat concatenates two arrays (and I've seen other people on the internet use the namearray_concat_aggfor a similar purpose). But I'm open toarray_unnest_aggif you think that's better.
Ack, let's use array_concat_agg then.
58c94e2 to
31338f1
Compare
maryliag
left a comment
There was a problem hiding this comment.
That's awesome!
I'll let someone from SQL Queries to review the implementation itself, but from a usage perspective on the view
Reviewed 17 of 17 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)
31338f1 to
3ac36ff
Compare
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Ugh I'm totally wrong -- it's array_cat, so |
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)
Previously, rytaft (Rebecca Taft) wrote…
Ugh I'm totally wrong -- it's array_cat, so
array_cat_aggis probably what we want. Sorry for the hassle.
Where does this array_cat come from? On a quick search I can see that both array_concat_agg (e.g. BigQuery) and array_cat_agg (e.g. citus) are used. To me array_concat_agg seems more intuitive, and since PG doesn't support it anyway, I'd personally choose that version.
Previously, yuzefovich (Yahor Yuzefovich) wrote…
|
DrewKimball
left a comment
There was a problem hiding this comment.
Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
func (a *arrayConcatAggregate) Add(ctx context.Context, datum tree.Datum, _ ...tree.Datum) error { if datum == tree.DNull { return a.arr.Append(datum)
DNull has a nonzero size; should we account for it here?
Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context, SELECT unnest(NULL::int[]) returns no rows.
rytaft
left a comment
There was a problem hiding this comment.
Awesome! Thanks for doing this!
Reviewed 16 of 17 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
DNull has a nonzero size; should we account for it here?
Also, it seems odd to me that we'd add a
NULLelement when passed aNULLargument. For context,SELECT unnest(NULL::int[])returns no rows.
+1, would be good to add a test that this aggregate produces the same output as using array_agg with LATERAL ... unnest
3ac36ff to
ed3f25e
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
DNull has a nonzero size; should we account for it here?
Indeed, it's non-zero size, but we have a global singleton for it, so it is effectively free. (Side note: I wonder whether we should just change dNull.Size to return 0?)
Also, it seems odd to me that we'd add a NULL element when passed a NULL argument. For context,
SELECT unnest(NULL::int[])returns no rows.
I followed the behavior or array_agg which does include NULLs into the result array. In my mind this new aggregate builtin is an aggregate function firstmost, and the closest existing aggregate we have is array_agg, so it should be as close to that one as possible. This difference makes it so that array_concat_agg is different from array_agg with LATERAL ... unnest, but it's up to us how we define it. In particular, in the single view where it is now used we are removing all NULLs from the result array, so the end result will be the same.
That said, I can also see us not including NULLs into the result. Then we'd remove the call to array_remove in that view. I'd also probably then rename the aggregate to array_unnest_agg.
Let's all agree on the semantics and naming here.
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match e.g., given the following rows: I would expect the following output (i.e., with one NULL, not two): But feel free to overrule me if you're convinced otherwise... |
ed3f25e to
4b68284
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I agree that if any of the arrays we are aggregating contains a NULL, then we should include that in the output array (this would match
array_agg). But if the input array itself is NULL, it seems odd to me to treat it the same way as if an element were NULL.e.g., given the following rows:
[1, 2, 3] [NULL, 4] NULL [5]I would expect the following output (i.e., with one NULL, not two):
[1, 2, 3, NULL, 4, 5]But feel free to overrule me if you're convinced otherwise...
I was wrong, this sounds very reasonable, just made this change.
Let's agree on the naming - I have slight preference for array_concat_agg (or array_unnest_agg), Becca seems to prefer array_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃
Wow, so much pressure... I guess I'm in favor of |
4b68284 to
8f47b74
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @maryliag, and @rytaft)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I was wrong, this sounds very reasonable, just made this change.
Let's agree on the naming - I have slight preference for
array_concat_agg(orarray_unnest_agg), Becca seems to preferarray_cat_agg, @DrewKimball I guess your preference is the deciding factor 😃
Ok, renamed to array_cat_agg. Does one of you still think that some test to compare with LATERAL ... unnest is needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)
I'm happy. Thanks for doing this! |
rytaft
left a comment
There was a problem hiding this comment.
thanks for all the iterations on this
Reviewed 3 of 3 files at r4, 2 of 2 files at r5, 17 of 17 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @yuzefovich)
pkg/sql/sem/builtins/aggregate_builtins.go line 1517 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Ok, renamed to
array_cat_agg. Does one of you still think that some test to compare withLATERAL ... unnestis needed? (It seems to me like an overkill given the very simple behavior of this aggregate.)
I think the current tests are enough (although I added one small suggestion). Thanks!
pkg/sql/logictest/testdata/logic_test/aggregate line 194 at r6 (raw file):
(6, 2, 3, 'b', '1ms', '{6, 2, 3}'), (7, 2, 2, 'b', '4 days', '{7, 2, 2}'), (8, 4, 2, 'A', '3 years', '{8, 4, 2}')
nit: consider adding a NULL element inside one of these arrays.
This commit introduces a new `array_cat_agg` aggregate builtin function that takes in an array type as its input, and then unnests each array and appends all its elements into a single result array. In other words, it behaves similar to `array_agg(unnest(array_column))`. This function doesn't have an analogue in Postgres. However, some of our SQL observability tools need this functionality, and the current workaround of using a LATERAL JOIN often results in slow apply joins, so this new builtin should speed things up significantly. In particular, `crdb_internal.statement_statistics` view is now refactored to use the new builtin which removes an apply join from it. The choice of this particular name comes from the fact that we have the `array_cat` builtin which concatenates two arrays. Release note (sql change): New aggregate builtin function `array_cat_agg` is introduced. It behaves similar to how `array_agg(unnest(array_column))` would - namely, it takes arrays as its input, unnests them into the array elements which are then aggregated into a single result array (i.e. it's similar to concatenating all input arrays into a single one).
8f47b74 to
cb426a6
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
TFTRs!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)
|
Build failed (retrying...): |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from cb426a6 to blathers/backport-release-22.2-97826: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit introduces a new
array_cat_aggaggregate builtin functionthat takes in an array type as its input, and then unnests each array
and appends all its elements into a single result array. In other
words, it behaves similar to
array_agg(unnest(array_column)). Thisfunction doesn't have an analogue in Postgres. However, some of our SQL
observability tools need this functionality, and the current workaround
of using a LATERAL JOIN often results in slow apply joins, so this new
builtin should speed things up significantly. In particular,
crdb_internal.statement_statisticsview is now refactored to use thenew builtin which removes an apply join from it. The choice of this
particular name comes from the fact that we have the
array_catbuiltinwhich concatenates two arrays.
Fixes: #97502.
Release note (sql change): New aggregate builtin function
array_cat_aggis introduced. It behaves similar to howarray_agg(unnest(array_column))would - namely, it takes arrays as itsinput, unnests them into the array elements which are then aggregated
into a single result array (i.e. it's similar to concatenating all input
arrays into a single one).