memo: better selectivity estimates on single-table ORed predicates#89358
memo: better selectivity estimates on single-table ORed predicates#89358craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
I see you added the DNM label -- is this ready for review or should I hold off? |
Not ready yet. I wanted to get started, but I need to add tests and see if there are cases where selectivity should be additive, for example, disjuncts on the same column: a=1 or a=2 or a=3. Though maybe these would be combined prior to getting to this spot in the code. |
1f0ea03 to
37b6429
Compare
msirek
left a comment
There was a problem hiding this comment.
RFAL
Reviewable status:
complete! 0 of 0 LGTMs obtained
e405479 to
f42ee26
Compare
michae2
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)
pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):
scalarProps := filter.ScalarProps() constrainedCols.UnionWith(scalarProps.OuterCols) if scalarProps.Constraints != nil {
(This isn't really a comment on this PR, just something that is confusing me. Maybe this is a question for @rytaft or @mgartner.) I find this condition very confusing. There is something implicit connecting the existence of scalarProps.Constraints with not needing to handle these disjunctions, but I don't understand what it is. Can't constraints also describe disjunctions? And why does the first branch of this condition modify histograms and distinct counts, while the second branch modifies row count and selectivity? Shouldn't both branches modify the same parts of the statistics?
pkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):
// The independence assumption may not be correct in all cases. // Would using the full formula lead to more errors since the independence // assumption is used in more terms in that formula?
Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming P(A and B) = P(A) * P(B) then both P((A or B) and C) from the iterative approach and the equivalent terms P(A and C) + P(B and C) - P(A and B and C) from the textbook solution become P(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)
michae2
left a comment
There was a problem hiding this comment.
This is great, and the idea makes sense. It's just taking me some time to understand the code.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 13 of 13 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)
pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):
Can't constraints also describe disjunctions?
Yes they can, but I think the idea is that if scalarProps.Constraints is not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.
And why does the first branch of this condition modify histograms and distinct counts
It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either x or y given the predicate x = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.
Does this help?
pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):
var selectivities []props.Selectivity selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)
nit: no need to declare the var above, can just do selectivities := make(...
pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):
// temporary stats struct, and combine the disjunct selectivities. // union the selectivity and row counts. sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)
you don't need this line anymore, and you can get rid of unionStats
pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):
// constraintUnion. disjunctionSelectivity := combineOredSelectivities(selectivities) s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCount
super nit: could use *=
pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):
└── (((((c0:1 = 1) OR (c1:2 = 1)) OR (c2:3 = 1)) OR (c3:4 = 1)) OR (c4:5 = 1)) OR (c5:6 = 1) [type=bool, outer=(1-6)] # Expected row count 0.19
It's 0.16 -- is that ok?
michae2
left a comment
There was a problem hiding this comment.
Reviewed all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @msirek)
pkg/sql/opt/memo/statistics_builder.go line 3181 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Can't constraints also describe disjunctions?
Yes they can, but I think the idea is that if
scalarProps.Constraintsis not nil, it means the constraints library has already done the hard work of figuring out how the disjunction (or whatever the predicate is) maps to a constraint. We already know how to use a constraint to update the stats, so we fall back to that approach if it's available.And why does the first branch of this condition modify histograms and distinct counts
It's clear how constraints modify the histograms and distinct counts, whereas I don't think it's so obvious for generic disjunctions. For example, I don't think you can say anything about the distinct counts / histograms of either
xorygiven the predicatex = 1 OR y = 1. I think we apply the resulting selectivity to the distinct count / histogram later on, which just reduces all buckets according to the selectivity.Does this help?
Yes, it does. Thank you!
For posterity: constraint.Set can only describe disjunctions within one column, not disjunctions across columns. And histogram and distinct counts are column-level stats, whereas selectivity and row count are table-level stats. So if our disjunction is entirely within one column, we'll have a constraint set and we can apply it to column-level stats. If our disjunction is across columns we won't have a constraint set ((*constraint.Set).Union returns an unconstrained set, which turns into nil in (*memo.logicalPropsBuilder).buildFiltersItemProps) and can only change table-level stats.
michae2
left a comment
There was a problem hiding this comment.
Reviewed 10 of 13 files at r2.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @mgartner and @msirek)
pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):
├── scan classrequest │ ├── columns: studentid:6(int!null) firstchoiceclassid:7(int) secondchoiceclassid:8(int) thirdchoiceclassid:9(int) │ ├── stats: [rows=1000, distinct(8)=100, null(8)=10]
This is interesting, I wonder why this has the distinct and null counts now?
pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):
"num_eq": 1, "num_range": 10, "upper_bound": "10"
num_range and distinct_range should be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).
pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):
], "histo_col_type": "INT8", "avg_size": 3
nit: not strictly necessary, but these objects should each also have "histo_version": 2, "null_count": 0
msirek
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go line 3203 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: no need to declare the var above, can just do
selectivities := make(...
Fixed
Code quote:
selectivities = make([]props.Selectivity, 0, len(constraintUnion)+numUnappliedDisjuncts)pkg/sql/opt/memo/statistics_builder.go line 3207 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
you don't need this line anymore, and you can get rid of
unionStats
Fixed
Code quote:
sb.constrainExpr(e, constraintUnion[0], relProps, &unionStats)pkg/sql/opt/memo/statistics_builder.go line 3221 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
super nit: could use
*=
modified as suggested
Code quote:
s.RowCount = disjunctionSelectivity.AsFloat() * s.RowCountpkg/sql/opt/memo/statistics_builder.go line 4308 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Maybe I'm misunderstanding this comment or doing the math wrong, but isn't this iterative approach exactly the same as the textbook solution when using the independence assumption? In other words, assuming
P(A and B) = P(A) * P(B)then bothP((A or B) and C)from the iterative approach and the equivalent termsP(A and C) + P(B and C) - P(A and B and C)from the textbook solution becomeP(A)P(C) + P(B)P(C) - P(A)P(B)P(C). I think they only differ when not making the independence assumption. (Maybe that's exactly what this comment is saying?)
Thanks for looking at this in detail. Your math looks correct, and I worked through it as well. Previously, I thought I had an example where the results from the fully-expanded formula and iterative approach differed, but I could not find it or reconstruct it. I've updated the comments.
pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
This is interesting, I wonder why this has the distinct and null counts now?
It is due to the following diff in buildDisjunctionConstraints():
If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?
pkg/sql/opt/memo/testdata/stats/select line 3609 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
num_rangeanddistinct_rangeshould be 9 here and in all the histograms below (assuming these histograms represent the numbers {1 .. 10}).
Fixed
Code quote:
"distinct_range": 10,
pkg/sql/opt/memo/testdata/stats/select line 3613 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: not strictly necessary, but these objects should each also have
"histo_version": 2, "null_count": 0
Added
Code quote:
"histo_col_type": "INT8",
pkg/sql/opt/memo/testdata/stats/select line 3857 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
It's 0.16 -- is that ok?
Not sure what changed, but it's back to 0.19 in the latest code.
Code quote:
Expected row count 0.19
michae2
left a comment
There was a problem hiding this comment.
Reviewed 2 of 13 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @msirek, and @rytaft)
pkg/sql/opt/memo/testdata/stats/join line 1753 at r2 (raw file):
Previously, msirek (Mark Sirek) wrote…
It is due to the following diff in
buildDisjunctionConstraints():If any non-tight disjuncts were found, we'd give up building any disjunction constraints. Maybe the default when there is no constraint does not initialize a default number of nulls and distinct values? Do you think we should be?
It seems fine the way it is now (using the defaults).
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @msirek)
mgartner
left a comment
There was a problem hiding this comment.
Very nice!
Reviewed 10 of 13 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale)
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
|
I'm going to backport this to v22.1 and v22.2 along with #94389 which will gate it behind a session setting. |
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
94389: sql: add optimizer_use_improved_disjunction_stats session setting r=mgartner a=mgartner This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in #89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Epic: None Release note: None 94428: kvserver: attempt deflake rq multistore rebalance r=kvoli a=kvoli Previously,`TestReplicateQueueRebalanceMultiStore` would calculate the cluster lease and range count to assert against that the cluster balanced. This was problematic as the number of ranges may reduce over the duration of the test, or an inconsistent snapshot of range count could be taken. This patch attempts to deflake the test by using the total number of ranges that is returned when calculating the replicas and leases per store, in order to avoid the above inconsistency. Stressing with the patch: ``` $ dev test pkg/kv/kvserver -fTestReplicateQueueRebalanceMultiStore -v --stress ... 195 runs so far, 0 failures, over 18m25s ``` Informs: #92219 Release note: None Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `false`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
92327: backupccl: add tenant backup nemesis tests r=msbutler a=stevendanna This adds a bit of randomized testing to stress incremental tenant backups. Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible. Epic: CRDB-20264 Release note: None 94442: opt: fix minor mistakes in legacy disjunction stats logic r=mgartner a=mgartner These subtle bugs were discovered while creating #94439, the backport of #89358 and #94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Epic: None Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `true`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None
This commit adds the `optimizer_use_improved_disjunction_stats` session setting that defaults to `false`. When `true`, the improved disjunction logic added in cockroachdb#89358 is performed. When `false`, the logic reverts to the previous logic. This setting will allow use to backport the new logic without impacting query plans of running clusters by defaulting the setting to false. Release note: None
These subtle bugs were discovered while creating cockroachdb#94439, the backport of cockroachdb#89358 and cockroachdb#94389 to v22.1. They have already been included in that backport. The backport to v22.2 has not yet been created, and it will include this commit. Release note: None

Fixes #75090
Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.
This is addressed by using the method of combining ORed selectivities
introduced by #74303:
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.
Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.