Skip to content

memo: better selectivity estimates on single-table ORed predicates#89358

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
msirek:75090
Nov 30, 2022
Merged

memo: better selectivity estimates on single-table ORed predicates#89358
craig[bot] merged 1 commit intocockroachdb:masterfrom
msirek:75090

Conversation

@msirek
Copy link
Copy Markdown
Contributor

@msirek msirek commented Oct 5, 2022

Fixes #75090

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by #74303:

   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)

The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.

@msirek msirek requested a review from a team as a code owner October 5, 2022 01:24
@msirek msirek marked this pull request as draft October 5, 2022 01:24
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@msirek msirek changed the title memo: better selectivity estimates on single-table ORed predicates [DNM] memo: better selectivity estimates on single-table ORed predicates Oct 5, 2022
@msirek msirek added the do-not-merge bors won't merge a PR with this label. label Oct 5, 2022
@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Oct 5, 2022

I see you added the DNM label -- is this ready for review or should I hold off?

@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented Oct 5, 2022

I see you added the DNM label -- is this ready for review or should I hold off?

Not ready yet. I wanted to get started, but I need to add tests and see if there are cases where selectivity should be additive, for example, disjuncts on the same column: a=1 or a=2 or a=3. Though maybe these would be combined prior to getting to this spot in the code.

@msirek msirek force-pushed the 75090 branch 3 times, most recently from 1f0ea03 to 37b6429 Compare November 6, 2022 23:40
@msirek msirek changed the title [DNM] memo: better selectivity estimates on single-table ORed predicates memo: better selectivity estimates on single-table ORed predicates Nov 6, 2022
@msirek msirek removed the do-not-merge bors won't merge a PR with this label. label Nov 6, 2022
@msirek msirek marked this pull request as ready for review November 6, 2022 23:41
Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@msirek msirek requested review from michae2 and rytaft November 6, 2022 23:42
@msirek msirek force-pushed the 75090 branch 2 times, most recently from e405479 to f42ee26 Compare November 8, 2022 14:55
Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

	scalarProps := filter.ScalarProps()
	constrainedCols.UnionWith(scalarProps.OuterCols)
	if scalarProps.Constraints != nil {

(This isn't really a comment on this PR, just something that is confusing me. Maybe this is a question for @rytaft or @mgartner.) I find this condition very confusing. There is something implicit connecting the existence of scalarProps.Constraints with not needing to handle these disjunctions, but I don't understand what it is. Can't constraints also describe disjunctions? And why does the first branch of this condition modify histograms and distinct counts, while the second branch modifies row count and selectivity? Shouldn't both branches modify the same parts of the statistics?


pkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):

// The independence assumption may not be correct in all cases.
// Would using the full formula lead to more errors since the independence
// assumption is used in more terms in that formula?

Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming P(A and B) = P(A) * P(B) then both P((A or B) and C) from the iterative approach and the equivalent terms P(A and C) + P(B and C) - P(A and B and C) from the textbook solution become P(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and the idea makes sense. It's just taking me some time to understand the code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice clean solution!

Reviewed 13 of 13 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)


pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

Can't constraints also describe disjunctions?

Yes they can, but I think the idea is that if scalarProps.Constraints is not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.

And why does the first branch of this condition modify histograms and distinct counts

It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either x or y given the predicate x = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.

Does this help?


pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):

		var selectivities []props.Selectivity
		selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)

nit: no need to declare the var above, can just do selectivities := make(...


pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):

		// temporary stats struct, and combine the disjunct selectivities.
		// union the selectivity and row counts.
		sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)

you don't need this line anymore, and you can get rid of unionStats


pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):

		// constraintUnion.
		disjunctionSelectivity := combineOredSelectivities(selectivities)
		s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCount

super nit: could use *=


pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):

           └── (((((c0:1 = 1) OR (c1:2 = 1)) OR (c2:3 = 1)) OR (c3:4 = 1)) OR (c4:5 = 1)) OR (c5:6 = 1) [type=bool, outer=(1-6)]

# Expected row count 0.19

It's 0.16 -- is that ok?

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)


pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can't constraints also describe disjunctions?

Yes they can, but I think the idea is that if scalarProps.Constraints is not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.

And why does the first branch of this condition modify histograms and distinct counts

It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either x or y given the predicate x = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.

Does this help?

Yes, it does. Thank you!

For posterity: constraint.Set can only describe disjunctions within one column, not disjunctions across columns. And histogram and distinct counts are column-level stats, whereas selectivity and row count are table-level stats. So if our disjunction is entirely within one column, we'll have a constraint set and we can apply it to column-level stats. If our disjunction is across columns we won't have a constraint set ((*constraint.Set).Union returns an unconstrained set, which turns into nil in (*memo.logicalPropsBuilder).buildFiltersItemProps) and can only change table-level stats.

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Cool!

Reviewed 10 of 13 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner and @msirek)


pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

 ├── scan classrequest
 │    ├── columns: studentid:6(int!null) firstchoiceclassid:7(int) secondchoiceclassid:8(int) thirdchoiceclassid:9(int)
 │    ├── stats: [rows=1000, distinct(8)=100, null(8)=10]

This is interesting, I wonder why this has the distinct and null counts now?


pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):

            "num_eq": 1,
            "num_range": 10,
            "upper_bound": "10"

num_range and distinct_range should be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).


pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):

    ],
    "histo_col_type": "INT8",
    "avg_size": 3

nit: not strictly necessary, but these objects should each also have "histo_version": 2, "null_count": 0

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: no need to declare the var above, can just do selectivities := make(...

Fixed

Code quote:

selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)

pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

you don't need this line anymore, and you can get rid of unionStats

Fixed

Code quote:

sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)

pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

super nit: could use *=

modified as suggested

Code quote:

s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCount

pkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming P(A and B) = P(A) * P(B) then both P((A or B) and C) from the iterative approach and the equivalent terms P(A and C) + P(B and C) - P(A and B and C) from the textbook solution become P(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)

Thanks for looking at this in detail. Your math looks correct, and I worked through it as well. Previously, I thought I had an example where the results from the fully-expanded formula and iterative approach differed, but I could not find it or reconstruct it. I've updated the comments.


pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is interesting, I wonder why this has the distinct and null counts now?

It is due to the following diff in buildDisjunctionConstraints():

image.png

If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?


pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

num_range and distinct_range should be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).

Fixed

Code quote:

"distinct_range": 10,

pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: not strictly necessary, but these objects should each also have "histo_version": 2, "null_count": 0

Added

Code quote:

"histo_col_type": "INT8",

pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It's 0.16 -- is that ok?

Not sure what changed, but it's back to 0.19 in the latest code.

Code quote:

Expected row count 0.19

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 13 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @msirek, and @rytaft)


pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

It is due to the following diff in buildDisjunctionConstraints():

image.png

If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?

It seems fine the way it is now (using the defaults).

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @msirek)

@craig craig bot merged commit b21fd88 into cockroachdb:master Nov 30, 2022
Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Reviewed 10 of 13 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)

@msirek msirek deleted the 75090 branch December 22, 2022 21:39
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 28, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358. When `false`, the logic reverts to the previous
logic. This setting will allow use to backport the new logic without
impacting query plans of running clusters by defaulting the setting to
false.

Release note: None
@mgartner
Copy link
Copy Markdown
Contributor

I'm going to backport this to v22.1 and v22.2 along with #94389 which will gate it behind a session setting.

mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 29, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 29, 2022
94389: sql: add optimizer_use_improved_disjunction_stats session setting r=mgartner a=mgartner

This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in #89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Epic: None

Release note: None


94428: kvserver: attempt deflake rq multistore rebalance r=kvoli a=kvoli

Previously,`TestReplicateQueueRebalanceMultiStore` would calculate the
cluster lease and range count to assert against that the cluster
balanced. This was problematic as the number of ranges may reduce over
the duration of the test, or an inconsistent snapshot of range count
could be taken.

This patch attempts to deflake the test by using the total number of
ranges that is returned when calculating the replicas and leases per
store, in order to avoid the above inconsistency.

Stressing with the patch:

```
$ dev test pkg/kv/kvserver -fTestReplicateQueueRebalanceMultiStore -v --stress
...
195 runs so far, 0 failures, over 18m25s
```


Informs: #92219

Release note: None

Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 29, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 29, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 29, 2022
These subtle bugs were discovered while creating cockroachdb#94439, the backport
of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that
backport. The backport to v22.2 has not yet been created, and it will
include this commit.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 29, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 30, 2022
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `false`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 3, 2023
92327: backupccl: add tenant backup nemesis tests r=msbutler a=stevendanna

This adds a bit of randomized testing to stress incremental tenant backups.

Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible.

Epic: CRDB-20264

Release note: None

94442: opt: fix minor mistakes in legacy disjunction stats logic r=mgartner a=mgartner

These subtle bugs were discovered while creating #94439, the backport
of #89358 and #94389 to v22.1. They have already been included in that
backport. The backport to v22.2 has not yet been created, and it will
include this commit.

Epic: None

Release note: None


Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 3, 2023
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `true`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 3, 2023
These subtle bugs were discovered while creating cockroachdb#94439, the backport
of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that
backport. The backport to v22.2 has not yet been created, and it will
include this commit.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 4, 2023
This commit adds the `optimizer_use_improved_disjunction_stats` session
setting that defaults to `false`. When `true`, the improved disjunction
logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to
the previous logic. This setting will allow use to backport the new
logic without impacting query plans of running clusters by defaulting
the setting to false.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 4, 2023
These subtle bugs were discovered while creating cockroachdb#94439, the backport
of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that
backport. The backport to v22.2 has not yet been created, and it will
include this commit.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better selectivity estimates for ORed predicates

5 participants