Skip to content

opt: add support for calculating selectivity from multi-column stats#49134

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
rytaft:multi-col3
May 29, 2020
Merged

opt: add support for calculating selectivity from multi-column stats#49134
craig[bot] merged 4 commits intocockroachdb:masterfrom
rytaft:multi-col3

Conversation

@rytaft
Copy link
Copy Markdown
Collaborator

@rytaft rytaft commented May 15, 2020

opt: fix bug with histograms and multi-span index constraints

Prior to this commit, filtering a histogram with a multi-span index
constraint could lead to incorrect results if the histogram column
was part of the exact prefix of the constraint. This was due to the
fact that the same value appeared in multiple spans, breaking the
assumption of the histogram code that values in spans were always
ordered and non-overlapping. For example, filtering a histogram on
column b with the constraint /b/c/: [/1/2 - /1/2] [/1/4 - /1/4]
would return an incorrect result, because the values in the matching
histogram bucket would be counted twice.

This commit fixes the problem by only considering a single span if
the column is part of the exact prefix.

Release note (performance improvement): Fixed a bug in the
histogram logic in the optimizer which was causing an over-estimate
for the cardinality of constrained index scans in some cases when
multiple columns of the index were constrained. This problem was
introduced early in the development for the 20.2 release so should
not have ever been part of a release. The fix improves the optimizer's
cardinality estimates so may result in better query plan selection.

opt: fix distinct count estimates for index constraint columns

Prior to this commit, the statistics_builder was incorrectly estimating
the distinct count of columns that were only slightly constrained as part
of an index constraint. For example, it was estimating based on
constraints such as /a/b: [/1 - /5/6] or /a/b: [ - /5/6] that the distinct
count of column b should be reduced by 2/3. However, in reality, we cannot
assume anything about the distinct count of column b based on those two
constraints.

This commit fixes the estimate by only reducing the distinct count for
columns that are part of the prefix of the constraint (columns for which
all the spans have the same start and end values) or the first column after.

Release note (performance improvement): Fixed the optimizer's distinct count
estimate for columns constrained by an index constraint, which was too low
in some cases. The fix improves the optimizer's cardinality estimates, which
can lead to better query plan selection.

opt: ensure multi-col stats are consistent with single-col stats

Prior to this commit, it was possible that the estimated distinct count
of a multi-column statistic was larger than the product of the estimated
distinct counts of all of its individual columns. This could happen because
we reduce the distinct count of columns that are constrained by a predicate,
but we don't update the distinct counts of any multi-column stats that
contain that column. This is internally inconsistent and could lead to bad
cardinality estimates.

This commit fixes the problem by explicitly checking for such
inconsistencies and fixing them after the stats for each operator are
built or a column stat is calculated. If it is found that the estimated
distinct count is too high, it is reset to the product of the distinct
counts of its constituent columns.

This commit also adds a check that the distinct count of the multi-column
statistic is no smaller than the largest distinct count of its constituent
columns (although I haven't yet found a test case where this was a problem).

Release note (performance improvement): Fixed the optimizer's estimated
distinct count for a multi-column statistic when all of the columns in the
statistic are constrained by a filter predicate. The fix can lead to
improved cardinality estimates, leading to better query plan selection
in some cases.

opt: add support for calculating selectivity from multi-column stats

This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
by taking the product of selectivities of each constrained column. In
the general case, this can be represented by the formula:

                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}

(2) If useMultiCol is true, we assume there is some correlation between
columns. In this case, we calculate the selectivity using multi-column
statistics.

                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
This formula looks simple, but the challenge is that it is difficult
to determine the correct value for new_distinct({constrained columns})
if each column is not constrained to a single value. For example, if
new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
3 or 4. We estimate the new distinct count as follows, using the concept
of "soft functional dependency (FD) strength":
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
perfectly correlated, old_distinct({x,y})=100. Using the example from
above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
into the equation, we get:
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
If x and y are completely independent, however, old_distinct({x,y})=1000.
In this case, we get:
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4

Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:

  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)

where w currently set to 0.9.

This selectivity is used update the row count and the distinct count for the
unconstrained columns.

Fixes #34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.

@rytaft rytaft requested a review from a team as a code owner May 15, 2020 20:33
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must have been a lot of work!

First three commits LGTM Still looking through the last one.

Should we add multi-column stats to the tpcc and tpch stats in testfixtures with this change? That way we would see what the impact on the plans is.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go, line 3115 at r4 (raw file):

//                            ⎛ FD_strength - min_FD_strength ⎞  // scales FD_strength
//       FD_strength_scaled = ⎜ ----------------------------- ⎟  // to be between
//                            ⎝      1 - min_FD_strength      ⎠  // 0 and 1

This is very cool! Is it from somewhere? Would be nice to include a reference


pkg/sql/opt/memo/statistics_builder.go, line 3140 at r4 (raw file):

// following selectivity:
//
//   selectivity = (1 - w) * (eq. 1) + w * (eq. 2)

[nit] use equation, it's more clear when you're reading just that line (it stands out)


pkg/sql/opt/memo/statistics_builder.go, line 3177 at r4 (raw file):

	}

	var multiColSet opt.ColSet

This code is pretty hard to follow. For example, it wasn't clear without carefully looking at all the code that if useMultiColsis false, we'll always return in the if multiColSet.Len() <= 1 check.

I think it would be easier to understand if we completely separated the two paths (perhaps in two different variants of the function). The multi-col variant could call the non-multi-col variant for the final weighted sum.


pkg/sql/opt/memo/statistics_builder.go, line 3233 at r4 (raw file):

	// comment above the function definition for details about the formula.

	// When there are 3 or more columns in the set and at least one has

Curious if you saw any issues that were fixed by this check.

I keep wondering if things would be simpler if we made all distinct counts always be >= 1.


pkg/sql/opt/memo/testdata/logprops/scan, line 119 at r4 (raw file):

SELECT * FROM t WHERE b IN ('a', 'b') AND c IN (1, 2) AND a IN (2, 3)
----
select

I think this was meant to be a scan with all the given constraints, maybe force the index with a hint


pkg/sql/opt/xform/testdata/external/trading, line 924 at r4 (raw file):

ORDER BY extract(day from d.TransactionDate)
----
sort

@andy-kimball would be good if you could take a look over the changes in this file since you already understand these queries very well, and see if there are any concerning plan changes

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @rytaft)


pkg/sql/opt/xform/testdata/external/trading, line 837 at r4 (raw file):

 │    │         │    │    ├── key: (8)
 │    │         │    │    ├── fd: ()-->(7), (1)-->(2-6), (2,4,5)~~>(1,3,6), (8)-->(9-15), (15)-->(8-14), (1)==(8), (8)==(1)
 │    │         │    │    ├── select

This query plan is significantly worse, because we are doing a table scan rather than just fetching the top 50 rows after the paging key. I don't think the stats/costing are at fault here, though. This is another example where we're not doing a good enough job pushing down limit hints (or even hard limits, if the optimizer gets smart enough). Go ahead and add a "Problem" comment above to this effect.


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

      │    │    │    │    │    ├── columns: transactions.dealerid:9!null transactions.isbuy:10!null date:11!null accountname:12!null customername:13!null
      │    │    │    │    │    ├── constraint: /9/10/11: [/1/false/'2020-02-23 00:00:00+00:00' - /1/false/'2020-03-01 00:00:00+00:00']
      │    │    │    │    │    ├── stats: [rows=1181111.11, distinct(9)=1, null(9)=0, distinct(10)=1, null(10)=0, distinct(11)=1181111.11, null(11)=0, distinct(9,10)=1, null(9,10)=0, distinct(9-11)=1181111.11, null(9-11)=0]

How come this estimate jumped up so much higher than it was? Before, it was ~12K. Now it's ~1M. The date constraints only select ~1 week worth of transactions, out of ~10 years of transactions. That means we're selecting only about ~0.1% of the rows in the table. I'd expect the estimate to be on the order of 20K rows, since there are 20M rows in the table. We were right on the money with our old estimate, but the new estimate is almost 2 orders of magnitude off.

Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Should we add multi-column stats to the tpcc and tpch stats in testfixtures with this change? That way we would see what the impact on the plans is.

I added those already in a different PR (#48374). No plans changed and no estimates got significantly worse (or better).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)


pkg/sql/opt/memo/statistics_builder.go, line 3115 at r4 (raw file):

Previously, RaduBerinde wrote…

This is very cool! Is it from somewhere? Would be nice to include a reference

Yea, the FD strength formula is from a paper I found, although they use it for a different purpose. Anyway, I agree it's good to reference it -- added.


pkg/sql/opt/memo/statistics_builder.go, line 3140 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] use equation, it's more clear when you're reading just that line (it stands out)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 3177 at r4 (raw file):

Previously, RaduBerinde wrote…

This code is pretty hard to follow. For example, it wasn't clear without carefully looking at all the code that if useMultiColsis false, we'll always return in the if multiColSet.Len() <= 1 check.

I think it would be easier to understand if we completely separated the two paths (perhaps in two different variants of the function). The multi-col variant could call the non-multi-col variant for the final weighted sum.

Good call - done.


pkg/sql/opt/memo/statistics_builder.go, line 3217 at r4 (raw file):

			if multiColNullCount < inputStats.RowCount {
				// Subtract the expected chance of collisions with nulls already collected.
				multiColNullCount += colStat.NullCount * (1 - multiColNullCount/inputStats.RowCount)

Note -- I realized this calculation was wrong so I fixed it (it was using the old definition of multi-column null count)


pkg/sql/opt/memo/statistics_builder.go, line 3233 at r4 (raw file):

Previously, RaduBerinde wrote…

Curious if you saw any issues that were fixed by this check.

I keep wondering if things would be simpler if we made all distinct counts always be >= 1.

Yea, but I mostly saw it for distinct counts = 1 (I'm using <= 1 here just to be thorough -- I don't think fractional distinct counts are the culprit here).

I forget the exact test case that it fixed, but consider an example like x=1 AND y=1 AND z IN (1,2,3,4,5). So distinct_count(x)=1, distinct_count(y)=1, and distinct_count(z)=5. Suppose that there are 1000 rows in the input, and all three columns initially have 100 distinct values. Using equation 2, selectivity({x,y})=1/1000, while selectivity({x,y,z})=5/1000.

I've rephrased this comment to hopefully make it a little bit clearer, but let me know if you think it needs more (or if I should include this example in the comment). I also moved it further down in the function to improve the logical flow.


pkg/sql/opt/memo/testdata/logprops/scan, line 119 at r4 (raw file):

Previously, RaduBerinde wrote…

I think this was meant to be a scan with all the given constraints, maybe force the index with a hint

Done.


pkg/sql/opt/xform/testdata/external/trading, line 837 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This query plan is significantly worse, because we are doing a table scan rather than just fetching the top 50 rows after the paging key. I don't think the stats/costing are at fault here, though. This is another example where we're not doing a good enough job pushing down limit hints (or even hard limits, if the optimizer gets smart enough). Go ahead and add a "Problem" comment above to this effect.

Done.


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How come this estimate jumped up so much higher than it was? Before, it was ~12K. Now it's ~1M. The date constraints only select ~1 week worth of transactions, out of ~10 years of transactions. That means we're selecting only about ~0.1% of the rows in the table. I'd expect the estimate to be on the order of 20K rows, since there are 20M rows in the table. We were right on the money with our old estimate, but the new estimate is almost 2 orders of magnitude off.

We were just lucky before. Since there's no histogram on the date column, and the data type is timestamptz (not date), we don't know how many distinct values of date are selected by the date range. Therefore we assume the selectivity is 1/9. The filters on dealerid and isbuy are also not especially selective, so the multi-column selectivity estimate is about 1/17.

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We were just lucky before. Since there's no histogram on the date column, and the data type is timestamptz (not date), we don't know how many distinct values of date are selected by the date range. Therefore we assume the selectivity is 1/9. The filters on dealerid and isbuy are also not especially selective, so the multi-column selectivity estimate is about 1/17.

I didn't realize we don't have histograms on date types. How come we don't?

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 21, 2020


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I didn't realize we don't have histograms on date types. How come we don't?

We currently only collect histograms on the first column of each index (as well as all boolean columns).

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We currently only collect histograms on the first column of each index (as well as all boolean columns).

Ah, so we can collect on date columns, we just don't in this case b/c it's not the first column in the index? Seems with this new multi-stats support, we should start collecting on all column prefixes in an index (i.e. multi-column histograms). How much work would that be? What are the downsides? Would it even fix this particular case if we did? I think we've talked about this before, but I can't remember the answers.

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 21, 2020


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Ah, so we can collect on date columns, we just don't in this case b/c it's not the first column in the index? Seems with this ew multi-stats support, we should start collecting on all columns in an index. How much work would that be? What are the downsides? Would it even fix this particular case if we did?

Yea, we could do that. I don't think it would be much work at all -- we'd just need to update like one line of code in sql/create_stats.go.

The potential downsides are higher memory utilization and computation during stats collection, and additional computation during stats calculation in the optimizer.

It does look like it would fix this particular case. I tested adding a simple histogram for the date column with a 10-year range and it updated the estimate to ~35,000 rows for the constrained scan of transactions and returned to the plan we were seeing before.

This was the histogram I added:

    "histo_col_type": "timestamptz",
    "histo_buckets": [
      {"num_eq": 0, "num_range": 0, "distinct_range": 0, "upper_bound": "2010-03-01 00:00:00+00:00"},
      {"num_eq": 0, "num_range": 20000000, "distinct_range": 20000000, "upper_bound": "2020-03-01 00:00:00+00:00"}
    ]

This also makes me think that maybe we should add simple histograms like this (just 2 buckets) for all columns in the table (in addition to the histograms on index columns with 200 buckets). I thought we had an issue open for that at some point...

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Yea, we could do that. I don't think it would be much work at all -- we'd just need to update like one line of code in sql/create_stats.go.

The potential downsides are higher memory utilization and computation during stats collection, and additional computation during stats calculation in the optimizer.

It does look like it would fix this particular case. I tested adding a simple histogram for the date column with a 10-year range and it updated the estimate to ~35,000 rows for the constrained scan of transactions and returned to the plan we were seeing before.

This was the histogram I added:

    "histo_col_type": "timestamptz",
    "histo_buckets": [
      {"num_eq": 0, "num_range": 0, "distinct_range": 0, "upper_bound": "2010-03-01 00:00:00+00:00"},
      {"num_eq": 0, "num_range": 20000000, "distinct_range": 20000000, "upper_bound": "2020-03-01 00:00:00+00:00"}
    ]

This also makes me think that maybe we should add simple histograms like this (just 2 buckets) for all columns in the table (in addition to the histograms on index columns with 200 buckets). I thought we had an issue open for that at some point...

I like the idea of two bucket histograms. Can you file an issue, and also update the comment for this test pointing out that the current plan is problematic and how we'd need to fix it?

How difficult would multi-column histograms on all index prefixes be?

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 21, 2020


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I like the idea of two bucket histograms. Can you file an issue, and also update the comment for this test pointing out that the current plan is problematic and how we'd need to fix it?

How difficult would multi-column histograms on all index prefixes be?

Filed #49374, and added to the comment for this query.

Multi-column histograms would be a bit more work since we currently assume histograms only involve a single column. We'd need to make non-trivial changes to the stats collection code, as well as the histogram filtering code in the optimizer. But I do think it's worth considering, since it could considerably improve the accuracy of our estimates.

rytaft added 4 commits May 29, 2020 09:52
Prior to this commit, filtering a histogram with a multi-span index
constraint could lead to incorrect results if the histogram column
was part of the exact prefix of the constraint. This was due to the
fact that the same value appeared in multiple spans, breaking the
assumption of the histogram code that values in spans were always
ordered and non-overlapping. For example, filtering a histogram on
column b with the constraint /b/c/: [/1/2 - /1/2] [/1/4 - /1/4] would
return an incorrect result, because the values in the matching
histogram bucket would be counted twice.

This commit fixes the problem by only considering a single span if
the column is part of the exact prefix.

Release note (performance improvement): Fixed a bug in the
histogram logic in the optimizer which was causing an over-estimate
for the cardinality of constrained index scans in some cases when
multiple columns of the index were constrained. This problem was
introduced early in the development for the 20.2 release so should
not have ever been part of a release. The fix improves the optimizer's
cardinality estimates so may result in better query plan selection.
Prior to this commit, the statistics_builder was incorrectly estimating
the distinct count of columns that were only slightly constrained as part
of an index constraint. For example, it was estimating based on
constraints such as /a/b: [/1 - /5/6] or /a/b: [ - /5/6] that the distinct
count of column b should be reduced by 2/3. However, in reality, we cannot
assume anything about the distinct count of column b based on those two
constraints.

This commit fixes the estimate by only reducing the distinct count for
columns that are part of the prefix of the constraint (columns for which
all the spans have the same start and end values) or the first column after.

Release note (performance improvement): Fixed the optimizer's distinct count
estimate for columns constrained by an index constraint, which was too low
in some cases. The fix improves the optimizer's cardinality estimates, which
can lead to better query plan selection.
Prior to this commit, it was possible that the estimated distinct count
of a multi-column statistic was larger than the product of the estimated
distinct counts of all of its individual columns. This could happen because
we reduce the distinct count of columns that are constrained by a predicate,
but we don't update the distinct counts of any multi-column stats that
contain that column. This is internally inconsistent and could lead to bad
cardinality estimates.

This commit fixes the problem by explicitly checking for such
inconsistencies and fixing them after the stats for each operator are
built or a column stat is calculated. If it is found that the estimated
distinct count is too high, it is reset to the product of the distinct
counts of its constituent columns.

This commit also adds a check that the distinct count of the multi-column
statistic is no smaller than the largest distinct count of its constituent
columns (although I haven't yet found a test case where this was a problem).

Release note (performance improvement): Fixed the optimizer's estimated
distinct count for a multi-column statistic when all of the columns in the
statistic are constrained by a filter predicate. The fix can lead to
improved cardinality estimates, leading to better query plan selection
in some cases.
This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:
```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```
(2) If useMultiCol is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.
```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```
    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":
```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```
    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:
```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```
    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:
```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```
Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:
```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```
where w currently set to 0.9.

This selectivity will be used later to update the row count and the
distinct count for the unconstrained columns.

Fixes cockroachdb#34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.
Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping -- let me know if you see any problems with this approach for the last commit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)


pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Filed #49374, and added to the comment for this query.

Multi-column histograms would be a bit more work since we currently assume histograms only involve a single column. We'd need to make non-trivial changes to the stats collection code, as well as the histogram filtering code in the optimizer. But I do think it's worth considering, since it could considerably improve the accuracy of our estimates.

Filed #49698 to track the work for multi-column histograms.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Sorry for the delay. The updates look great!

I will make another pass over the entire change when I get a chance, but you don't have to wait for that to merge. If I have any major questions I'll bring them up directly.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 29, 2020

TFTR!!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 29, 2020

Build succeeded

@craig craig bot merged commit 6dd95c2 into cockroachdb:master May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opt: add support for multi-column statistics

4 participants