opt: add support for calculating selectivity from multi-column stats by rytaft · Pull Request #49134 · cockroachdb/cockroach

rytaft · 2020-05-15T20:33:45Z

opt: fix bug with histograms and multi-span index constraints

Prior to this commit, filtering a histogram with a multi-span index
constraint could lead to incorrect results if the histogram column
was part of the exact prefix of the constraint. This was due to the
fact that the same value appeared in multiple spans, breaking the
assumption of the histogram code that values in spans were always
ordered and non-overlapping. For example, filtering a histogram on
column b with the constraint /b/c/: [/1/2 - /1/2] [/1/4 - /1/4]
would return an incorrect result, because the values in the matching
histogram bucket would be counted twice.

This commit fixes the problem by only considering a single span if
the column is part of the exact prefix.

Release note (performance improvement): Fixed a bug in the
histogram logic in the optimizer which was causing an over-estimate
for the cardinality of constrained index scans in some cases when
multiple columns of the index were constrained. This problem was
introduced early in the development for the 20.2 release so should
not have ever been part of a release. The fix improves the optimizer's
cardinality estimates so may result in better query plan selection.

opt: fix distinct count estimates for index constraint columns

Prior to this commit, the statistics_builder was incorrectly estimating
the distinct count of columns that were only slightly constrained as part
of an index constraint. For example, it was estimating based on
constraints such as /a/b: [/1 - /5/6] or /a/b: [ - /5/6] that the distinct
count of column b should be reduced by 2/3. However, in reality, we cannot
assume anything about the distinct count of column b based on those two
constraints.

This commit fixes the estimate by only reducing the distinct count for
columns that are part of the prefix of the constraint (columns for which
all the spans have the same start and end values) or the first column after.

Release note (performance improvement): Fixed the optimizer's distinct count
estimate for columns constrained by an index constraint, which was too low
in some cases. The fix improves the optimizer's cardinality estimates, which
can lead to better query plan selection.

opt: ensure multi-col stats are consistent with single-col stats

Prior to this commit, it was possible that the estimated distinct count
of a multi-column statistic was larger than the product of the estimated
distinct counts of all of its individual columns. This could happen because
we reduce the distinct count of columns that are constrained by a predicate,
but we don't update the distinct counts of any multi-column stats that
contain that column. This is internally inconsistent and could lead to bad
cardinality estimates.

This commit fixes the problem by explicitly checking for such
inconsistencies and fixing them after the stats for each operator are
built or a column stat is calculated. If it is found that the estimated
distinct count is too high, it is reset to the product of the distinct
counts of its constituent columns.

This commit also adds a check that the distinct count of the multi-column
statistic is no smaller than the largest distinct count of its constituent
columns (although I haven't yet found a test case where this was a problem).

Release note (performance improvement): Fixed the optimizer's estimated
distinct count for a multi-column statistic when all of the columns in the
statistic are constrained by a filter predicate. The fix can lead to
improved cardinality estimates, leading to better query plan selection
in some cases.

opt: add support for calculating selectivity from multi-column stats

This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
by taking the product of selectivities of each constrained column. In
the general case, this can be represented by the formula:

                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}

(2) If useMultiCol is true, we assume there is some correlation between
columns. In this case, we calculate the selectivity using multi-column
statistics.

                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠

This formula looks simple, but the challenge is that it is difficult
to determine the correct value for new_distinct({constrained columns})
if each column is not constrained to a single value. For example, if
new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
3 or 4. We estimate the new distinct count as follows, using the concept
of "soft functional dependency (FD) strength":

      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠

Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
perfectly correlated, old_distinct({x,y})=100. Using the example from
above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
into the equation, we get:

      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2

If x and y are completely independent, however, old_distinct({x,y})=1000.
In this case, we get:

      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4

Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:

  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)

where w currently set to 0.9.

This selectivity is used update the row count and the distinct count for the
unconstrained columns.

Fixes #34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.

cockroach-teamcity · 2020-05-15T20:33:55Z

This change is

RaduBerinde

This must have been a lot of work!

First three commits LGTM Still looking through the last one.

Should we add multi-column stats to the tpcc and tpch stats in testfixtures with this change? That way we would see what the impact on the plans is.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go, line 3115 at r4 (raw file):

//                            ⎛ FD_strength - min_FD_strength ⎞  // scales FD_strength
//       FD_strength_scaled = ⎜ ----------------------------- ⎟  // to be between
//                            ⎝      1 - min_FD_strength      ⎠  // 0 and 1

This is very cool! Is it from somewhere? Would be nice to include a reference

pkg/sql/opt/memo/statistics_builder.go, line 3140 at r4 (raw file):

// following selectivity:
//
//   selectivity = (1 - w) * (eq. 1) + w * (eq. 2)

[nit] use equation, it's more clear when you're reading just that line (it stands out)

pkg/sql/opt/memo/statistics_builder.go, line 3177 at r4 (raw file):

	}

	var multiColSet opt.ColSet

This code is pretty hard to follow. For example, it wasn't clear without carefully looking at all the code that if useMultiColsis false, we'll always return in the if multiColSet.Len() <= 1 check.

I think it would be easier to understand if we completely separated the two paths (perhaps in two different variants of the function). The multi-col variant could call the non-multi-col variant for the final weighted sum.

pkg/sql/opt/memo/statistics_builder.go, line 3233 at r4 (raw file):

	// comment above the function definition for details about the formula.

	// When there are 3 or more columns in the set and at least one has

Curious if you saw any issues that were fixed by this check.

I keep wondering if things would be simpler if we made all distinct counts always be >= 1.

pkg/sql/opt/memo/testdata/logprops/scan, line 119 at r4 (raw file):

SELECT * FROM t WHERE b IN ('a', 'b') AND c IN (1, 2) AND a IN (2, 3)
----
select

I think this was meant to be a scan with all the given constraints, maybe force the index with a hint

pkg/sql/opt/xform/testdata/external/trading, line 924 at r4 (raw file):

ORDER BY extract(day from d.TransactionDate)
----
sort

@andy-kimball would be good if you could take a look over the changes in this file since you already understand these queries very well, and see if there are any concerning plan changes

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @rytaft)

pkg/sql/opt/xform/testdata/external/trading, line 837 at r4 (raw file):

 │    │         │    │    ├── key: (8)
 │    │         │    │    ├── fd: ()-->(7), (1)-->(2-6), (2,4,5)~~>(1,3,6), (8)-->(9-15), (15)-->(8-14), (1)==(8), (8)==(1)
 │    │         │    │    ├── select

This query plan is significantly worse, because we are doing a table scan rather than just fetching the top 50 rows after the paging key. I don't think the stats/costing are at fault here, though. This is another example where we're not doing a good enough job pushing down limit hints (or even hard limits, if the optimizer gets smart enough). Go ahead and add a "Problem" comment above to this effect.

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

      │    │    │    │    │    ├── columns: transactions.dealerid:9!null transactions.isbuy:10!null date:11!null accountname:12!null customername:13!null
      │    │    │    │    │    ├── constraint: /9/10/11: [/1/false/'2020-02-23 00:00:00+00:00' - /1/false/'2020-03-01 00:00:00+00:00']
      │    │    │    │    │    ├── stats: [rows=1181111.11, distinct(9)=1, null(9)=0, distinct(10)=1, null(10)=0, distinct(11)=1181111.11, null(11)=0, distinct(9,10)=1, null(9,10)=0, distinct(9-11)=1181111.11, null(9-11)=0]

How come this estimate jumped up so much higher than it was? Before, it was ~12K. Now it's ~1M. The date constraints only select ~1 week worth of transactions, out of ~10 years of transactions. That means we're selecting only about ~0.1% of the rows in the table. I'd expect the estimate to be on the order of 20K rows, since there are 20M rows in the table. We were right on the money with our old estimate, but the new estimate is almost 2 orders of magnitude off.

rytaft

TFTR!

Should we add multi-column stats to the tpcc and tpch stats in testfixtures with this change? That way we would see what the impact on the plans is.

I added those already in a different PR (#48374). No plans changed and no estimates got significantly worse (or better).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)

pkg/sql/opt/memo/statistics_builder.go, line 3115 at r4 (raw file):

Previously, RaduBerinde wrote…

This is very cool! Is it from somewhere? Would be nice to include a reference

Yea, the FD strength formula is from a paper I found, although they use it for a different purpose. Anyway, I agree it's good to reference it -- added.

pkg/sql/opt/memo/statistics_builder.go, line 3140 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] use equation, it's more clear when you're reading just that line (it stands out)

Done.

pkg/sql/opt/memo/statistics_builder.go, line 3177 at r4 (raw file):

Previously, RaduBerinde wrote…

This code is pretty hard to follow. For example, it wasn't clear without carefully looking at all the code that if useMultiColsis false, we'll always return in the if multiColSet.Len() <= 1 check.

I think it would be easier to understand if we completely separated the two paths (perhaps in two different variants of the function). The multi-col variant could call the non-multi-col variant for the final weighted sum.

Good call - done.

pkg/sql/opt/memo/statistics_builder.go, line 3217 at r4 (raw file):

			if multiColNullCount < inputStats.RowCount {
				// Subtract the expected chance of collisions with nulls already collected.
				multiColNullCount += colStat.NullCount * (1 - multiColNullCount/inputStats.RowCount)

Note -- I realized this calculation was wrong so I fixed it (it was using the old definition of multi-column null count)

pkg/sql/opt/memo/statistics_builder.go, line 3233 at r4 (raw file):

Previously, RaduBerinde wrote…

Curious if you saw any issues that were fixed by this check.

I keep wondering if things would be simpler if we made all distinct counts always be >= 1.

Yea, but I mostly saw it for distinct counts = 1 (I'm using <= 1 here just to be thorough -- I don't think fractional distinct counts are the culprit here).

I forget the exact test case that it fixed, but consider an example like x=1 AND y=1 AND z IN (1,2,3,4,5). So distinct_count(x)=1, distinct_count(y)=1, and distinct_count(z)=5. Suppose that there are 1000 rows in the input, and all three columns initially have 100 distinct values. Using equation 2, selectivity({x,y})=1/1000, while selectivity({x,y,z})=5/1000.

I've rephrased this comment to hopefully make it a little bit clearer, but let me know if you think it needs more (or if I should include this example in the comment). I also moved it further down in the function to improve the logical flow.

pkg/sql/opt/memo/testdata/logprops/scan, line 119 at r4 (raw file):

Previously, RaduBerinde wrote…

I think this was meant to be a scan with all the given constraints, maybe force the index with a hint

Done.

pkg/sql/opt/xform/testdata/external/trading, line 837 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This query plan is significantly worse, because we are doing a table scan rather than just fetching the top 50 rows after the paging key. I don't think the stats/costing are at fault here, though. This is another example where we're not doing a good enough job pushing down limit hints (or even hard limits, if the optimizer gets smart enough). Go ahead and add a "Problem" comment above to this effect.

Done.

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How come this estimate jumped up so much higher than it was? Before, it was ~12K. Now it's ~1M. The date constraints only select ~1 week worth of transactions, out of ~10 years of transactions. That means we're selecting only about ~0.1% of the rows in the table. I'd expect the estimate to be on the order of 20K rows, since there are 20M rows in the table. We were right on the money with our old estimate, but the new estimate is almost 2 orders of magnitude off.

We were just lucky before. Since there's no histogram on the date column, and the data type is timestamptz (not date), we don't know how many distinct values of date are selected by the date range. Therefore we assume the selectivity is 1/9. The filters on dealerid and isbuy are also not especially selective, so the multi-column selectivity estimate is about 1/17.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We were just lucky before. Since there's no histogram on the date column, and the data type is timestamptz (not date), we don't know how many distinct values of date are selected by the date range. Therefore we assume the selectivity is 1/9. The filters on dealerid and isbuy are also not especially selective, so the multi-column selectivity estimate is about 1/17.

I didn't realize we don't have histograms on date types. How come we don't?

rytaft · 2020-05-21T00:38:09Z

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I didn't realize we don't have histograms on date types. How come we don't?

We currently only collect histograms on the first column of each index (as well as all boolean columns).

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We currently only collect histograms on the first column of each index (as well as all boolean columns).

Ah, so we can collect on date columns, we just don't in this case b/c it's not the first column in the index? Seems with this new multi-stats support, we should start collecting on all column prefixes in an index (i.e. multi-column histograms). How much work would that be? What are the downsides? Would it even fix this particular case if we did? I think we've talked about this before, but I can't remember the answers.

rytaft · 2020-05-21T01:03:02Z

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Ah, so we can collect on date columns, we just don't in this case b/c it's not the first column in the index? Seems with this ew multi-stats support, we should start collecting on all columns in an index. How much work would that be? What are the downsides? Would it even fix this particular case if we did?

Yea, we could do that. I don't think it would be much work at all -- we'd just need to update like one line of code in sql/create_stats.go.

The potential downsides are higher memory utilization and computation during stats collection, and additional computation during stats calculation in the optimizer.

It does look like it would fix this particular case. I tested adding a simple histogram for the date column with a 10-year range and it updated the estimate to ~35,000 rows for the constrained scan of transactions and returned to the plan we were seeing before.

This was the histogram I added:

    "histo_col_type": "timestamptz",
    "histo_buckets": [
      {"num_eq": 0, "num_range": 0, "distinct_range": 0, "upper_bound": "2010-03-01 00:00:00+00:00"},
      {"num_eq": 0, "num_range": 20000000, "distinct_range": 20000000, "upper_bound": "2020-03-01 00:00:00+00:00"}
    ]

This also makes me think that maybe we should add simple histograms like this (just 2 buckets) for all columns in the table (in addition to the histograms on index columns with 200 buckets). I thought we had an issue open for that at some point...

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Yea, we could do that. I don't think it would be much work at all -- we'd just need to update like one line of code in sql/create_stats.go.

The potential downsides are higher memory utilization and computation during stats collection, and additional computation during stats calculation in the optimizer.

It does look like it would fix this particular case. I tested adding a simple histogram for the date column with a 10-year range and it updated the estimate to ~35,000 rows for the constrained scan of transactions and returned to the plan we were seeing before.

This was the histogram I added:
    "histo_col_type": "timestamptz",
    "histo_buckets": [
      {"num_eq": 0, "num_range": 0, "distinct_range": 0, "upper_bound": "2010-03-01 00:00:00+00:00"},
      {"num_eq": 0, "num_range": 20000000, "distinct_range": 20000000, "upper_bound": "2020-03-01 00:00:00+00:00"}
    ]
This also makes me think that maybe we should add simple histograms like this (just 2 buckets) for all columns in the table (in addition to the histograms on index columns with 200 buckets). I thought we had an issue open for that at some point...

I like the idea of two bucket histograms. Can you file an issue, and also update the comment for this test pointing out that the current plan is problematic and how we'd need to fix it?

How difficult would multi-column histograms on all index prefixes be?

rytaft · 2020-05-21T14:40:55Z

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I like the idea of two bucket histograms. Can you file an issue, and also update the comment for this test pointing out that the current plan is problematic and how we'd need to fix it?

How difficult would multi-column histograms on all index prefixes be?

Filed #49374, and added to the comment for this query.

Multi-column histograms would be a bit more work since we currently assume histograms only involve a single column. We'd need to make non-trivial changes to the stats collection code, as well as the histogram filtering code in the optimizer. But I do think it's worth considering, since it could considerably improve the accuracy of our estimates.

Prior to this commit, filtering a histogram with a multi-span index constraint could lead to incorrect results if the histogram column was part of the exact prefix of the constraint. This was due to the fact that the same value appeared in multiple spans, breaking the assumption of the histogram code that values in spans were always ordered and non-overlapping. For example, filtering a histogram on column b with the constraint /b/c/: [/1/2 - /1/2] [/1/4 - /1/4] would return an incorrect result, because the values in the matching histogram bucket would be counted twice. This commit fixes the problem by only considering a single span if the column is part of the exact prefix. Release note (performance improvement): Fixed a bug in the histogram logic in the optimizer which was causing an over-estimate for the cardinality of constrained index scans in some cases when multiple columns of the index were constrained. This problem was introduced early in the development for the 20.2 release so should not have ever been part of a release. The fix improves the optimizer's cardinality estimates so may result in better query plan selection.

Prior to this commit, the statistics_builder was incorrectly estimating the distinct count of columns that were only slightly constrained as part of an index constraint. For example, it was estimating based on constraints such as /a/b: [/1 - /5/6] or /a/b: [ - /5/6] that the distinct count of column b should be reduced by 2/3. However, in reality, we cannot assume anything about the distinct count of column b based on those two constraints. This commit fixes the estimate by only reducing the distinct count for columns that are part of the prefix of the constraint (columns for which all the spans have the same start and end values) or the first column after. Release note (performance improvement): Fixed the optimizer's distinct count estimate for columns constrained by an index constraint, which was too low in some cases. The fix improves the optimizer's cardinality estimates, which can lead to better query plan selection.

Prior to this commit, it was possible that the estimated distinct count of a multi-column statistic was larger than the product of the estimated distinct counts of all of its individual columns. This could happen because we reduce the distinct count of columns that are constrained by a predicate, but we don't update the distinct counts of any multi-column stats that contain that column. This is internally inconsistent and could lead to bad cardinality estimates. This commit fixes the problem by explicitly checking for such inconsistencies and fixing them after the stats for each operator are built or a column stat is calculated. If it is found that the estimated distinct count is too high, it is reset to the product of the distinct counts of its constituent columns. This commit also adds a check that the distinct count of the multi-column statistic is no smaller than the largest distinct count of its constituent columns (although I haven't yet found a test case where this was a problem). Release note (performance improvement): Fixed the optimizer's estimated distinct count for a multi-column statistic when all of the columns in the statistic are constrained by a filter predicate. The fix can lead to improved cardinality estimates, leading to better query plan selection in some cases.

This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.

rytaft

Gentle ping -- let me know if you see any problems with this approach for the last commit.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)

pkg/sql/opt/xform/testdata/external/trading, line 963 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Filed #49374, and added to the comment for this query.

Multi-column histograms would be a bit more work since we currently assume histograms only involve a single column. We'd need to make non-trivial changes to the stats collection code, as well as the histogram filtering code in the optimizer. But I do think it's worth considering, since it could considerably improve the accuracy of our estimates.

Filed #49698 to track the work for multi-column histograms.

RaduBerinde

Sorry for the delay. The updates look great!

I will make another pass over the entire change when I get a chance, but you don't have to wait for that to merge. If I have any major questions I'll bring them up directly.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @mgartner, @mjibson, and @RaduBerinde)

rytaft · 2020-05-29T20:15:22Z

TFTR!!

bors r+

craig · 2020-05-29T21:07:14Z

Build succeeded

GitHub CI (Cockroach)

rytaft requested review from RaduBerinde, andy-kimball, madelynnblue and mgartner May 15, 2020 20:33

rytaft requested a review from a team as a code owner May 15, 2020 20:33

RaduBerinde reviewed May 19, 2020

View reviewed changes

andy-kimball reviewed May 19, 2020

View reviewed changes

rytaft force-pushed the multi-col3 branch from ac9b826 to 97d2f74 Compare May 20, 2020 23:09

rytaft commented May 20, 2020

View reviewed changes

andy-kimball reviewed May 21, 2020

View reviewed changes

rytaft force-pushed the multi-col3 branch from 97d2f74 to c3de13b Compare May 21, 2020 14:36

rytaft added 4 commits May 29, 2020 09:52

rytaft force-pushed the multi-col3 branch from c3de13b to df34c56 Compare May 29, 2020 14:55

rytaft commented May 29, 2020

View reviewed changes

RaduBerinde reviewed May 29, 2020

View reviewed changes

craig bot merged commit 6dd95c2 into cockroachdb:master May 29, 2020

Conversation

rytaft commented May 15, 2020

Uh oh!

cockroach-teamcity commented May 15, 2020

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented May 21, 2020

Uh oh!

andy-kimball left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rytaft commented May 21, 2020

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented May 21, 2020

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented May 29, 2020

Uh oh!

craig bot commented May 29, 2020

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andy-kimball left a comment •

edited

Loading