memo: better selectivity estimates on single-table ORed predicates by msirek · Pull Request #89358 · cockroachdb/cockroach

msirek · 2022-10-05T01:24:09Z

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by #74303:

   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)

The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.

cockroach-teamcity · 2022-10-05T01:24:16Z

This change is

rytaft · 2022-10-05T15:13:48Z

I see you added the DNM label -- is this ready for review or should I hold off?

msirek · 2022-10-05T15:54:18Z

I see you added the DNM label -- is this ready for review or should I hold off?

Not ready yet. I wanted to get started, but I need to add tests and see if there are cases where selectivity should be additive, for example, disjuncts on the same column: a=1 or a=2 or a=3. Though maybe these would be combined prior to getting to this spot in the code.

msirek

RFAL

Reviewable status: complete! 0 of 0 LGTMs obtained

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

	scalarProps := filter.ScalarProps()
	constrainedCols.UnionWith(scalarProps.OuterCols)
	if scalarProps.Constraints != nil {

(This isn't really a comment on this PR, just something that is confusing me. Maybe this is a question for @rytaft or @mgartner.) I find this condition very confusing. There is something implicit connecting the existence of scalarProps.Constraints with not needing to handle these disjunctions, but I don't understand what it is. Can't constraints also describe disjunctions? And why does the first branch of this condition modify histograms and distinct counts, while the second branch modifies row count and selectivity? Shouldn't both branches modify the same parts of the statistics?

pkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):

// The independence assumption may not be correct in all cases.
// Would using the full formula lead to more errors since the independence
// assumption is used in more terms in that formula?

Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming P(A and B) = P(A) * P(B) then both P((A or B) and C) from the iterative approach and the equivalent terms P(A and C) + P(B and C) - P(A and B and C) from the textbook solution become P(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)

michae2

This is great, and the idea makes sense. It's just taking me some time to understand the code.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)

rytaft

nice clean solution!

Reviewed 13 of 13 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)

pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

Can't constraints also describe disjunctions?

Yes they can, but I think the idea is that if scalarProps.Constraints is not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.

And why does the first branch of this condition modify histograms and distinct counts

It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either x or y given the predicate x = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.

Does this help?

pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):

		var selectivities []props.Selectivity
		selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)

nit: no need to declare the var above, can just do selectivities := make(...

pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):

		// temporary stats struct, and combine the disjunct selectivities.
		// union the selectivity and row counts.
		sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)

you don't need this line anymore, and you can get rid of unionStats

pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):

		// constraintUnion.
		disjunctionSelectivity := combineOredSelectivities(selectivities)
		s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCount

super nit: could use *=

pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):

           └── (((((c0:1 = 1) OR (c1:2 = 1)) OR (c2:3 = 1)) OR (c3:4 = 1)) OR (c4:5 = 1)) OR (c5:6 = 1) [type=bool, outer=(1-6)]

# Expected row count 0.19

It's 0.16 -- is that ok?

michae2

Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)

pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can't constraints also describe disjunctions?

Yes they can, but I think the idea is that if scalarProps.Constraints is not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.

And why does the first branch of this condition modify histograms and distinct counts

It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either x or y given the predicate x = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.

Does this help?

Yes, it does. Thank you!

For posterity: constraint.Set can only describe disjunctions within one column, not disjunctions across columns. And histogram and distinct counts are column-level stats, whereas selectivity and row count are table-level stats. So if our disjunction is entirely within one column, we'll have a constraint set and we can apply it to column-level stats. If our disjunction is across columns we won't have a constraint set ((*constraint.Set).Union returns an unconstrained set, which turns into nil in (*memo.logicalPropsBuilder).buildFiltersItemProps) and can only change table-level stats.

michae2

Cool!

Reviewed 10 of 13 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @mgartner and @msirek)

pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

 ├── scan classrequest
 │    ├── columns: studentid:6(int!null) firstchoiceclassid:7(int) secondchoiceclassid:8(int) thirdchoiceclassid:9(int)
 │    ├── stats: [rows=1000, distinct(8)=100, null(8)=10]

This is interesting, I wonder why this has the distinct and null counts now?

pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):

            "num_eq": 1,
            "num_range": 10,
            "upper_bound": "10"

num_range and distinct_range should be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).

pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):

    ],
    "histo_col_type": "INT8",
    "avg_size": 3

nit: not strictly necessary, but these objects should each also have "histo_version": 2, "null_count": 0

msirek

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: no need to declare the var above, can just do selectivities := make(...

Fixed

Code quote:

selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)

pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

you don't need this line anymore, and you can get rid of unionStats

Fixed

Code quote:

sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)

pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

super nit: could use *=

modified as suggested

Code quote:

s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCount

pkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming P(A and B) = P(A) * P(B) then both P((A or B) and C) from the iterative approach and the equivalent terms P(A and C) + P(B and C) - P(A and B and C) from the textbook solution become P(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)

Thanks for looking at this in detail. Your math looks correct, and I worked through it as well. Previously, I thought I had an example where the results from the fully-expanded formula and iterative approach differed, but I could not find it or reconstruct it. I've updated the comments.

pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is interesting, I wonder why this has the distinct and null counts now?

It is due to the following diff in buildDisjunctionConstraints():

If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?

pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

num_range and distinct_range should be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).

Fixed

Code quote:

"distinct_range": 10,

pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: not strictly necessary, but these objects should each also have "histo_version": 2, "null_count": 0

Added

Code quote:

"histo_col_type": "INT8",

pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It's 0.16 -- is that ok?

Not sure what changed, but it's back to 0.19 in the latest code.

Code quote:

Expected row count 0.19

michae2

Reviewed 2 of 13 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @msirek, and @rytaft)

pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

It is due to the following diff in buildDisjunctionConstraints():

If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?

It seems fine the way it is now (using the defaults).

rytaft

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @msirek)

mgartner

Very nice!

Reviewed 10 of 13 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

mgartner · 2022-12-29T18:49:28Z

I'm going to backport this to v22.1 and v22.2 along with #94389 which will gate it behind a session setting.

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

94389: sql: add optimizer_use_improved_disjunction_stats session setting r=mgartner a=mgartner This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in #89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Epic: None Release note: None 94428: kvserver: attempt deflake rq multistore rebalance r=kvoli a=kvoli Previously,`TestReplicateQueueRebalanceMultiStore` would calculate the cluster lease and range count to assert against that the cluster balanced. This was problematic as the number of ranges may reduce over the duration of the test, or an inconsistent snapshot of range count could be taken. This patch attempts to deflake the test by using the total number of ranges that is returned when calculating the replicas and leases per store, in order to avoid the above inconsistency. Stressing with the patch: ``` $ dev test pkg/kv/kvserver -fTestReplicateQueueRebalanceMultiStore -v --stress ... 195 runs so far, 0 failures, over 18m25s ``` Informs: #92219 Release note: None Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `false`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

92327: backupccl: add tenant backup nemesis tests r=msbutler a=stevendanna This adds a bit of randomized testing to stress incremental tenant backups. Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible. Epic: CRDB-20264 Release note: None 94442: opt: fix minor mistakes in legacy disjunction stats logic r=mgartner a=mgartner These subtle bugs were discovered while creating #94439, the backport of #89358 and #94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Epic: None Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None

This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `false`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None

These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None

msirek requested a review from a team as a code owner October 5, 2022 01:24

msirek marked this pull request as draft October 5, 2022 01:24

msirek changed the title ~~memo: better selectivity estimates on single-table ORed predicates~~ [DNM] memo: better selectivity estimates on single-table ORed predicates Oct 5, 2022

msirek added the do-not-merge bors won't merge a PR with this label. label Oct 5, 2022

This was referenced Oct 5, 2022

Better selectivity estimates for ORed predicates #75090

Closed

An Issue of Estimated Rows #89161

Closed

msirek force-pushed the 75090 branch 3 times, most recently from 1f0ea03 to 37b6429 Compare November 6, 2022 23:40

msirek changed the title ~~[DNM] memo: better selectivity estimates on single-table ORed predicates~~ memo: better selectivity estimates on single-table ORed predicates Nov 6, 2022

msirek removed the do-not-merge bors won't merge a PR with this label. label Nov 6, 2022

msirek marked this pull request as ready for review November 6, 2022 23:41

msirek commented Nov 6, 2022

View reviewed changes

msirek requested review from michae2 and rytaft November 6, 2022 23:42

msirek force-pushed the 75090 branch 2 times, most recently from e405479 to f42ee26 Compare November 8, 2022 14:55

michae2 reviewed Nov 17, 2022

View reviewed changes

rytaft approved these changes Nov 22, 2022

View reviewed changes

michae2 reviewed Nov 22, 2022

View reviewed changes

michae2 approved these changes Nov 23, 2022

View reviewed changes

mgartner self-requested a review November 23, 2022 15:27

msirek force-pushed the 75090 branch from f42ee26 to acbe4d6 Compare November 29, 2022 21:54

msirek commented Nov 29, 2022

View reviewed changes

michae2 approved these changes Nov 29, 2022

View reviewed changes

rytaft approved these changes Nov 29, 2022

View reviewed changes

craig bot merged commit b21fd88 into cockroachdb:master Nov 30, 2022

mgartner reviewed Nov 30, 2022

View reviewed changes

msirek deleted the 75090 branch December 22, 2022 21:39

mgartner mentioned this pull request Dec 28, 2022

sql: add optimizer_use_improved_disjunction_stats session setting #94389

Merged

mgartner added backport-22.1.x labels Dec 29, 2022

mgartner mentioned this pull request Dec 29, 2022

release-22.1: memo: better selectivity estimates on single-table ORed predicates #94439

Merged

mgartner mentioned this pull request Dec 29, 2022

opt: fix minor mistakes in legacy disjunction stats logic #94442

Merged

mgartner mentioned this pull request Jan 3, 2023

release-22.2: memo: better selectivity estimates on single-table ORed predicates #94664

Merged

msirek mentioned this pull request Jul 14, 2023

ORed predicate selectivity overestimate makes full scan of LIMIT query look cheap #106872

Open

Conversation

msirek commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Oct 5, 2022

Uh oh!

rytaft commented Oct 5, 2022

Uh oh!

msirek commented Oct 5, 2022

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner commented Dec 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

msirek commented Oct 5, 2022 •

edited

Loading