Skip to content

opt: ensure validation of unique constraints is efficient#65355

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
rytaft:efficient-validate
May 19, 2021
Merged

opt: ensure validation of unique constraints is efficient#65355
craig[bot] merged 2 commits intocockroachdb:masterfrom
rytaft:efficient-validate

Conversation

@rytaft
Copy link
Copy Markdown
Collaborator

@rytaft rytaft commented May 17, 2021

opt: ensure validation of unique constraints is efficient

This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans
and SplitGroupByFilteredScanIntoUnionScans.

SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy,
DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each
scan is ordered on the grouping columns. This ordering is then maintained by
the union-all operation and passed on to the grouping operation. Ordering on
the grouping columns is important since it enables the grouping operation to
execute in a streaming fashion, which is more efficient. Example:

   CREATE TABLE tab (
     region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')),
     data INT NOT NULL,
     INDEX (region, data)
   );

   SELECT DISTINCT data
   FROM tab;

   =>

   SELECT DISTINCT data
   FROM (SELECT * FROM tab WHERE region='ASIA')
   UNION ALL (SELECT * FROM tab WHERE region='EUROPE');

This rule does not actually build the streaming grouping operation, but it
allows another rule, GenerateStreamingGroupBy, to fire and use the new
interesting orderings provided by the UnionAll of scans to build a streaming
operation.

SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans,
but the scan is wrapped in a Select.

These transformations are important for ensuring that validation of the
unique constraint in an implicitly-partitioned unique index is efficient. The
validation query to verify that (a, b) is UNIQUE on table tbl looks like this:

   SELECT a, b
   FROM tbl
   WHERE a IS NOT NULL AND b IS NOT NULL
   GROUP BY a, b
   HAVING count(*) > 1
   LIMIT 1;

Without SplitGroupByFilteredScanIntoUnionScans, this query would require an
inefficient and memory-intensive hash group by operation. Note that the
previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
apply in cases where a and b are not nullable.

Fixes #56201

Release note (performance improvement): Validation of a new UNIQUE index in a
REGIONAL BY ROW table no longer requires an inefficient and memory-intensive
hash aggregation query. The optimizer can now plan the validation query so
that it uses all streaming operations, which are much more efficient.

opt: move splitScanIntoUnionScans to general_funcs.go

This commit moves splitScanIntoUnionScans and all helper functions
into general_funcs.go since it's now used in both limit_funcs.go and
groupby_funcs.go. This is a separate commit for ease of reviewing.

Release note: None

@rytaft rytaft requested review from a team, RaduBerinde and mgartner May 17, 2021 21:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 17, 2021

Please ignore the first commit; it's from #65350.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Nicely done!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)


pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

// SplitScanIntoUnionScans returns a UnionAll of Scan operators (with an
// optional hard limit) that each scan over a single key from the original
// Scan's constraints. If no such UnionAll of Scans can be found, ok=false is

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

	if scan.Relational().Stats.Available &&
		float64(scanCount*limit*threshold) >= scan.Relational().Stats.RowCount {

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.


pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

	// The index ordering must have a prefix of columns of length keyLength
	// followed by the ordering columns either in order or in reverse order.
	hasLimitOrderingSeq, reverse := indexHasOrderingSequence(

[nit] hasOrderingSeq


pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

)
=>
((OpName) $unionScans $aggs $private)

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

@rytaft rytaft force-pushed the efficient-validate branch from a1b3ca6 to 7b9ad1f Compare May 18, 2021 15:18
Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)


pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.

Done.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.

Done.


pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] hasOrderingSeq

Done.


pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

Previously, RaduBerinde wrote…

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

Yep! Added a comment.

@rytaft rytaft requested a review from DrewKimball May 18, 2021 15:39
Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] the PR description is missing the second commit

Looks great! :lgtm:

Reviewed 5 of 7 files at r2, 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

	scan memo.RelExpr, sp *memo.ScanPrivate, private *memo.GroupingPrivate,
) (_ memo.RelExpr, ok bool) {
	cons, ok := c.getKnownScanConstraint(sp)

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)


pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

# inefficient and memory-intensive hash group by operation. Note that the
# previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
# apply in cases where a and b are not nullable.

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

Copy link
Copy Markdown
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool!

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

rytaft added 2 commits May 19, 2021 08:42
This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans
and SplitGroupByFilteredScanIntoUnionScans.

SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy,
DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each
scan is ordered on the grouping columns. This ordering is then maintained by
the union-all operation and passed on to the grouping operation. Ordering on
the grouping columns is important since it enables the grouping operation to
execute in a streaming fashion, which is more efficient. Example:

   CREATE TABLE tab (
     region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')),
     data INT NOT NULL,
     INDEX (region, data)
   );

   SELECT DISTINCT data
   FROM tab;

   =>

   SELECT DISTINCT data
   FROM (SELECT * FROM tab WHERE region='ASIA')
   UNION ALL (SELECT * FROM tab WHERE region='EUROPE');

This rule does not actually build the streaming grouping operation, but it
allows another rule, GenerateStreamingGroupBy, to fire and use the new
interesting orderings provided by the UnionAll of scans to build a streaming
operation.

SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans,
but the scan is wrapped in a Select.

These transformations are important for ensuring that validation of the
unique constraint in an implicitly-partitioned unique index is efficient. The
validation query to verify that (a, b) is UNIQUE on table tbl looks like this:

   SELECT a, b
   FROM tbl
   WHERE a IS NOT NULL AND b IS NOT NULL
   GROUP BY a, b
   HAVING count(*) > 1
   LIMIT 1;

Without SplitGroupByFilteredScanIntoUnionScans, this query would require an
inefficient and memory-intensive hash group by operation. Note that the
previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
apply in cases where a and b are not nullable.

Fixes cockroachdb#56201

Release note (performance improvement): Validation of a new UNIQUE index in a
REGIONAL BY ROW table no longer requires an inefficient and memory-intensive
hash aggregation query. The optimizer can now plan the validation query so
that it uses all streaming operations, which are much more efficient.
This commit moves splitScanIntoUnionScans and all helper functions
into general_funcs.go since it's now used in both limit_funcs.go and
groupby_funcs.go. This is a separate commit for ease of reviewing.

Release note: None
@rytaft rytaft force-pushed the efficient-validate branch from 7b9ad1f to 35b1ec6 Compare May 19, 2021 13:43
Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

[nit] the PR description is missing the second commit

Done.

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)

It was moved in the second commit.


pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

Done.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r5, 3 of 3 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It was moved in the second commit.

Ahh I missed that...

Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 19, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql, opt: validating the unique constraint must be efficient for partitioned unique indexes

5 participants