opt: permit coexistance of inv, forward histograms by jordanlewis · Pull Request #84927 · cockroachdb/cockroach

jordanlewis · 2022-07-22T20:24:27Z

Previously, when trying to use persisted histograms to filter an
inverted column, if the histogram read from disk was a "forward
histogram", the operation would fail with an internal error.

This is because inverted histograms and forward histograms use different
datatypes on disk, and it's not possible to compare a forward histogram
(represented as the datatype of the column) with the inverted constraint
(represented as DBytes).

Now, this problem is corrected by making sure that we only store forward
histogram data in forward histogram column sets, and likewise for
inverted histogram data being only stored in inverted histogram column
sets.

Release note: None

cockroach-teamcity · 2022-07-22T20:24:35Z

This change is

jordanlewis · 2022-07-22T20:26:04Z

Note that this is a slightly more precise fix than the one I proposed in #84529. It will only ignore the histogram if the histogram is of the wrong type: in the steady state, the trigram inverted filter will successfully use the trigram inverted histogram when one exists.

I do have a concern: will the reverse problem exist? In other words, once we build an inverted histogram on the trigram-indexed column, will a forward scan use the wrong histogram too?

I couldn't reproduce any problem of this nature, but it surprises me - I'd expect that something would be going wrong of that nature.

mgartner

LGTM! Thanks for fixing this!

mgartner · 2022-07-22T20:26:37Z

pkg/sql/logictest/testdata/logic_test/trigram_indexes

+statement ok
+CREATE TABLE b (a) AS SELECT 'foo';
+ANALYZE b;
+CREATE INVERTED INDEX ON b(a gin_trgm_ops)


Don't you need to analyze after the index is created? The inverted column doesn't exist until the index is created.

The bug was explicitly in this state, where there is an inverted index but a forward histogram on the column that's being inverted indexed. Once you re-analyze, there is no bug. I've added a test case to ensure that both scenarios work okay.

mgartner · 2022-07-22T20:28:48Z

pkg/sql/opt/memo/statistics_builder.go

+					// wouldn't properly reject the histogram. The proper fix, tracked
+					// in #50655, is to add a type field onto the persisted histogram.
+					if inputHist.BucketCount() == 0 ||
+						inputHist.Bucket(0).UpperBound.ResolvedType().Family() == types.BytesFamily {


Internally the type we use for inverted columns is types.EncodedKey:

cockroach/pkg/sql/types/types.go

Lines 479 to 485 in 10a054b

EncodedKey = &T{

InternalType: InternalType{

Family: EncodedKeyFamily,

Oid: oid.T_unknown,

Locale: &emptyLocale,

},

}

But I guess we write the type bytes to the persisted histogram?

Yeah, we do - there is some commentary about it where we write the persisted inverted histograms.

mgartner · 2022-07-22T20:31:44Z

pkg/sql/opt/memo/statistics_builder.go

+					// looking at a forward histogram, since inverted histograms always
+					// are of type DBytes. We cannot use the histogram if this is the
+					// case.
+					// Note that this is a sketchy situation (no pun intended), because


Not only is it sketchy, it means that we won't have accurate row count estimates for queries on trigram indexes. As a result, the optimizer probably won't pick the best plan in a number of cases.

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Correct, that's what I meant to say.

jordanlewis · 2022-07-22T21:08:06Z

Hmm, a bunch of tests failed... I'm going to investigate, but I wonder if there is something more fishy underfoot here.

rytaft

Thanks for fixing this!

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)

pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Probably worth adding a stats-specific test for this. Can you add some tests in opt/memo/testdata/stats/scan similar to the tests you added in logic /execbuilder tests?

We can probably fix this issue by adding a check before line 865 or saving the previous value of s.RowCount.

pkg/sql/opt/memo/statistics_builder.go line 884 at r2 (raw file):

							continue
						case types.BytesFamily:
							break LOOP

nit: I'd like to avoid using these GOTO style statements here if possible. I feel like this would be a good candidate for a helper function.

jordanlewis

The execbuilder test I added doesn't have stable results - if I dump the histogram after running ANALYZE, the results differ from run to run. I know the statistics collection uses random sampling, but I guess I figured in logic tests we'd fix the seed. Is the proper way to fix this to inject artificial histograms rather than running ANALYZE? And if so, should all of the test cases that need the histograms really just belong in the stats testdata?

Also, in adding the stats test you suggested, I'm noticing that the stats tests have nice histograms with just a few buckets. How do you generate those nice test histograms? Is there a way to ask the stats system to use a lower bucket count?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Probably worth adding a stats-specific test for this. Can you add some tests in opt/memo/testdata/stats/scan similar to the tests you added in logic /execbuilder tests?

We can probably fix this issue by adding a check before line 865 or saving the previous value of s.RowCount.

That's true, we should also avoid editing the row count if we're looking at the wrong kind of histogram. I've fixed that.

But isn't there still the problem in the opposite case, where we're trying to use a histogram for an ordinary, non-inverted scan on a text column, and we see a set of inverted statistics? I think we'd want a guard in the opposite direction, in applyIndexConstraint, that makes sure that we aren't using an inverted histogram for a forward scan.

What I don't understand is why this theoretical problem doesn't manifest as a type error, like the original bug did.

rytaft

The execbuilder test I added doesn't have stable results - if I dump the histogram after running ANALYZE, the results differ from run to run. I know the statistics collection uses random sampling, but I guess I figured in logic tests we'd fix the seed. Is the proper way to fix this to inject artificial histograms rather than running ANALYZE? And if so, should all of the test cases that need the histograms really just belong in the stats testdata?

You could use INJECT STATISTICS, but I think it's also fine to get rid of these execbuilder tests altogether.

Also, in adding the stats test you suggested, I'm noticing that the stats tests have nice histograms with just a few buckets. How do you generate those nice test histograms? Is there a way to ask the stats system to use a lower bucket count?

We don't currently have a way to control the number of histogram buckets collected by CREATE STATISTICS/ANALYZE (it's hardcoded as 200 for index columns, and 2 for other columns). Those histograms in the stats tests were just manually constructed.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

That's true, we should also avoid editing the row count if we're looking at the wrong kind of histogram. I've fixed that.

Doesn't look like you've pushed the fix.

But isn't there still the problem in the opposite case, where we're trying to use a histogram for an ordinary, non-inverted scan on a text column, and we see a set of inverted statistics? I think we'd want a guard in the opposite direction, in applyIndexConstraint, that makes sure that we aren't using an inverted histogram for a forward scan.

I think you're right.

What I don't understand is why this theoretical problem doesn't manifest as a type error, like the original bug did.

I'm not sure -- it would be good to add some tests that exercise this.

jordanlewis

This is ready for another look, I redid it like we discussed on Slack. Now, the code carefully segments inverted histograms from forward histograms in the statistics map, so there's no chance for code to accidentally get the wrong type of histogram when building plans.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Correct, that's what I meant to say.

Removed this code.

pkg/sql/opt/memo/testdata/stats/inverted-geo line 774 at r3 (raw file):

      │    │    └── stats: [rows=1, distinct(2)=1, null(2)=0, avgsize(2)=4]
      │    │        histogram(2)=  0                                                                                            4.9994e-05                                                                                            0.99988                                                           7.4991e-05
      │    │                     <--- '0107000020E6100000010000000103000020E610000001000000040000008C7D4198BD2B574080A6FD3A111E4D40F0DF86928AA12BC03A59212197A35140CEA3CEE206B863C0FC7649EB60BA53408C7D4198BD2B574080A6FD3A111E4D40' --------- '0102000020E61000000300000005D8E086BB6365C03F9E5737DD1A53C0C04ECDED673B55C06711C00C7C0240C0B8EABD96072856404A9D2C529FC74EC0'

I think the reason for this change is that the test is buggy.

The injected statistics here should have histo_col_type=BYTES, not GEOGRAPHY, unless they were collected on the ordinary column. But I think we don't collect forward statistics on GEOGRAPHY columns. So I think this test is just wrong, thoughts? The fix would be to change the injected stats to be bytes type.

rytaft

thanks for working through these details!

I thought we had discussed iterating through older statistics to find both forward and inverted histograms? Feel free to open an issue and/or save that work for another PR. I think it's probably worth doing at some point, though (even before we fix #50655), since we can limit the extra computation to only cases where it's useful. For example, we can limit it to cases where we have a string column that also has an inverted index. We will (almost) always collect a forward histogram (it will just be smaller if there is no forward index), so we should try to find both the forward and inverted histograms for these columns.

Reviewed 3 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

			if needHistogram {
				invertedColOrds = invertedIndexCols[stat.ColumnOrdinal(0)]
				invertedStatistic = len(invertedColOrds) > 0 && !histogramIsNonBytes(stat.Histogram())

nit: can we remove the double-negative and change this to histogramIsBytes?

pkg/sql/opt/memo/statistics_builder.go line 687 at r4 (raw file):

				// to calling stats.ColStats.Add() on any inverted column statistics
				// created above.
				colStat, _ = stats.ColStats.Lookup(cols)

I don't think you need to do this anymore

pkg/sql/opt/memo/testdata/stats/inverted-geo line 774 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think the reason for this change is that the test is buggy.

The injected statistics here should have histo_col_type=BYTES, not GEOGRAPHY, unless they were collected on the ordinary column. But I think we don't collect forward statistics on GEOGRAPHY columns. So I think this test is just wrong, thoughts? The fix would be to change the injected stats to be bytes type.

Your change looks good

jordanlewis

I am probably missing something, but I thought that we were actually already doing that (and just ignoring all but the most recent one because of the way colStatsMap works:

		for i := 0; i < tab.StatisticCount(); i++ {
			stat := tab.Statistic(i)

As I understood, this loop finds every statistic in the table, including historical statistics.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)

rytaft

That's correct -- we are ignoring all but the most recent one for each set of columns. So in order to make sure that we get both types of histograms, I think we can make a relatively small change (I added a comment below).

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

			}

			if colStat, ok := stats.ColStats.Add(cols); ok {

I think we can ensure we get both types of histograms by changing this slightly as follows:

if colStat, ok := stats.ColStats.Add(cols); ok {
  // Keep what's already here
} else if colStat.Histogram == nil && needHistogram && !invertedStatistic {
  col := cols.SingleColumn()
  colStat.Histogram = &props.Histogram{}
  colStat.Histogram.Init(sb.evalCtx, col, stat.Histogram())
}

This takes advantage of the fact that stats.ColStats.Add(cols) returns the existing statistic if ok is false.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think we can ensure we get both types of histograms by changing this slightly as follows:
if colStat, ok := stats.ColStats.Add(cols); ok {
  // Keep what's already here
} else if colStat.Histogram == nil && needHistogram && !invertedStatistic {
  col := cols.SingleColumn()
  colStat.Histogram = &props.Histogram{}
  colStat.Histogram.Init(sb.evalCtx, col, stat.Histogram())
}
This takes advantage of the fact that stats.ColStats.Add(cols) returns the existing statistic if ok is false.

Sorry for being dense, but I still don't understand what situation this branch defends against that the code doesn't already handle. The code as written does handle finding both the forward and inverted statistic for a given column set regardless of where either is in the list of historical statistics, since the new branch on line 700 gets executed regardless of whether we already have a forward statistic for the column set.

The only situation I can think of where this would change behavior is something like this list of statistics for a column t:

a forward statistic with no histogram at time x
an inverted statistic with a histogram at time x
a forward statistic with a histogram at time x-1

Is this the situation you're thinking of too or am I still missing something?

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: can we remove the double-negative and change this to histogramIsBytes?

I'm not sure - would we ever get into a situation in which the histogram contains all null buckets? If so, we wouldn't be able to guarantee that the histogram was all bytes, just that it had no non-bytes buckets. That's why I named it this way.

pkg/sql/opt/memo/statistics_builder.go line 687 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think you need to do this anymore

Done.

rytaft

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'm not sure - would we ever get into a situation in which the histogram contains all null buckets? If so, we wouldn't be able to guarantee that the histogram was all bytes, just that it had no non-bytes buckets. That's why I named it this way.

I think that could happen if every row in the table is NULL. Maybe the better question is whether we would want to default to creating an inverted or a forward histogram in this case. I'm not sure it really matters in most cases, since I think we should be smart enough to just use the NULL counts and row counts directly to get the correct selectivity even without a histogram. The case where it might matter is if you implement my proposed change below to go back and add a forward histogram when one is found. In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

Alternatively, we could update the TableStatistic interface to include a HistogramType function that returns the type directly (then we wouldn't need to guess and you can get rid of the histogramIsNonBytes function). This info should be available with a bit of plumbing. If you think that's a better route, I'm happy to open a separate PR with that change.

Regardless, it would be good to add a test for this case.

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Sorry for being dense, but I still don't understand what situation this branch defends against that the code doesn't already handle. The code as written does handle finding both the forward and inverted statistic for a given column set regardless of where either is in the list of historical statistics, since the new branch on line 700 gets executed regardless of whether we already have a forward statistic for the column set.

The only situation I can think of where this would change behavior is something like this list of statistics for a column t:

a forward statistic with no histogram at time x

an inverted statistic with a histogram at time x

a forward statistic with a histogram at time x-1

Is this the situation you're thinking of too or am I still missing something?

I'm thinking of this:

an inverted statistic on column c with a histogram at time x
a forward statistic on column c with a histogram at time x
... more stats ...

For the first statistic, we will add a forward stat for column c with no histogram, and an inverted stat with a histogram. Once we get to the second statistic, as the code is currently written, we will see that a forward stat has already been added and it will be a no-op. My proposed change would ensure that we still save the histogram for this case.

As an alternative, you could just not add the forward stat if invertedStatistic is true. That might also work.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think that could happen if every row in the table is NULL. Maybe the better question is whether we would want to default to creating an inverted or a forward histogram in this case. I'm not sure it really matters in most cases, since I think we should be smart enough to just use the NULL counts and row counts directly to get the correct selectivity even without a histogram. The case where it might matter is if you implement my proposed change below to go back and add a forward histogram when one is found. In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

Alternatively, we could update the TableStatistic interface to include a HistogramType function that returns the type directly (then we wouldn't need to guess and you can get rid of the histogramIsNonBytes function). This info should be available with a bit of plumbing. If you think that's a better route, I'm happy to open a separate PR with that change.

Regardless, it would be good to add a test for this case.

Great point, I took a look and your suggestion was very simple to implement (no plumbing was necessary, the actual data was available on both the test catalog and production stat struct as-is) so I just went ahead and did it.

In the case where the data is all null, no histogram is created at all, so I think we're fine.

      {
          "avg_size": 0,
          "columns": [
              "a"
          ],
          "created_at": "2022-07-28 04:05:28.848149",
          "distinct_count": 1,
          "histo_col_type": "",
          "null_count": 3,
          "row_count": 3
      }

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I'm thinking of this:

an inverted statistic on column c with a histogram at time x

a forward statistic on column c with a histogram at time x

... more stats ...

For the first statistic, we will add a forward stat for column c with no histogram, and an inverted stat with a histogram. Once we get to the second statistic, as the code is currently written, we will see that a forward stat has already been added and it will be a no-op. My proposed change would ensure that we still save the histogram for this case.

As an alternative, you could just not add the forward stat if invertedStatistic is true. That might also work.

Ahh, I see, thanks for spelling it out, you're totally right of course.

We can't not add the forward stat if invertedStatistic is true, because in the case of ordinary inverted index types like JSON, we don't ever collect forward stats on that type - so we'd be missing out on the non-histogram stats.

I think your example snippet is also going to do something slightly weird - it has the potential of creating a statistic that consists of counts from an inverted histogram at time t, and a histogram from a forward histogram at a time t-1 when the counts were different.

The updated code now takes the entirety of the most recent forward statistic that has a histogram... I think that is the optimal thing.

And confirmed this change updated a test case in the trigram stats test file for the better! It now can use a forward histogram for a forward scan even though an inverted histogram on the same column was present.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):
Wait, is this still going to present a problem in a different case? You mention something similar above:

In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

time t-1: statistic with forward histogram for column c
time t: user truncates table
time t+1: statistic with no histogram because the table is empty

In this case, I think the code I just pushed (and your suggested snippet?) would fall into a similar hazard. The newest statistic is missing a histogram because the table is empty, so we replace it with an older statistic that has a histogram, that's now stale and out of date.

We could avoid this hazard with an extra bit per ColumnStatistic that is set to true if the information was derived from an inverted histogram, I guess...

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Wait, is this still going to present a problem in a different case? You mention something similar above:

In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

time t-1: statistic with forward histogram for column c
time t: user truncates table
time t+1: statistic with no histogram because the table is empty

In this case, I think the code I just pushed (and your suggested snippet?) would fall into a similar hazard. The newest statistic is missing a histogram because the table is empty, so we replace it with an older statistic that has a histogram, that's now stale and out of date.

We could avoid this hazard with an extra bit per ColumnStatistic that is set to true if the information was derived from an inverted histogram, I guess...

I added a commit on top that does this, but it bloats the memo by a bit. PTAL. Bummed that this keeps increasing in complexity.

rytaft

Reviewed 4 of 5 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Great point, I took a look and your suggestion was very simple to implement (no plumbing was necessary, the actual data was available on both the test catalog and production stat struct as-is) so I just went ahead and did it.

In the case where the data is all null, no histogram is created at all, so I think we're fine.
      {
          "avg_size": 0,
          "columns": [
              "a"
          ],
          "created_at": "2022-07-28 04:05:28.848149",
          "distinct_count": 1,
          "histo_col_type": "",
          "null_count": 3,
          "row_count": 3
      }

Nice!

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I added a commit on top that does this, but it bloats the memo by a bit. PTAL. Bummed that this keeps increasing in complexity.

Is there a way to avoid storing the extra bit in the ColStatsMap since we don't need this info outside of this function? One idea is to just check if we already have an inverted stat for this column. e.g., make a function like this:

func alreadyHasInvertedStat(invColOrds int[], stats *props.Statistics) {
  if len(invColOrds) > 0 {
    invCol := tabID.ColumnID(invColOrds[0])
    invCols := opt.MakeColSet(invCol)
    _, ok := stats.ColStats.Lookup(invCols)
    return ok
  }
  return false
}

rytaft · 2022-07-28T14:26:32Z

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is there a way to avoid storing the extra bit in the ColStatsMap since we don't need this info outside of this function? One idea is to just check if we already have an inverted stat for this column. e.g., make a function like this:
func alreadyHasInvertedStat(invColOrds int[], stats *props.Statistics) {
  if len(invColOrds) > 0 {
    invCol := tabID.ColumnID(invColOrds[0])
    invCols := opt.MakeColSet(invCol)
    _, ok := stats.ColStats.Lookup(invCols)
    return ok
  }
  return false
}

(BTW I didn't check if that compiles -- looks like there's at least a couple fixes needed)

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

(BTW I didn't check if that compiles -- looks like there's at least a couple fixes needed)

I moved the state into a map that we're already keeping around in the statsbuilder loop and added more tests that ensure we don't use old histograms even if new statistics have no histograms at all for both the inverted and forward case.

rytaft

thanks for all your work on this!

Reviewed 1 of 5 files at r3, 14 of 14 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)

pkg/sql/opt/memo/statistics_builder.go line 686 at r9 (raw file):

				//    If these conditions hold, it means that the previous histogram
				//    we found for the current colset was derived from an inverted
				//    histogram, and therefore doesn't have a histogram at all, and the

nit: and therefore doesn't -> and therefore the existing forward statistic doesn't

pkg/sql/opt/memo/statistics_builder.go line 715 at r9 (raw file):

			// Add inverted histograms if necessary.
			if needHistogram && invertedStatistic {
				// Record the fact that we just added an inverted statistic to this

nit: just added -> are adding

Previously, when trying to use persisted histograms to filter an inverted column, if the histogram read from disk was a "forward histogram", the operation would fail with an internal error. This is because inverted histograms and forward histograms use different datatypes on disk, and it's not possible to compare a forward histogram (represented as the datatype of the column) with the inverted constraint (represented as DBytes). Now, this problem is corrected by making sure that we only store forward histogram data in forward histogram column sets, and likewise for inverted histogram data being only stored in inverted histogram column sets. Release note: None

jordanlewis

Thanks for the patient reviews!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)

craig · 2022-07-29T21:54:04Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-30T01:01:08Z

Build succeeded:

GitHub CI (Cockroach)

jordanlewis requested a review from mgartner July 22, 2022 20:24

jordanlewis requested a review from a team as a code owner July 22, 2022 20:24

mgartner approved these changes Jul 22, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from 63395ee to 4e03e3a Compare July 23, 2022 19:05

rytaft reviewed Jul 23, 2022

View reviewed changes

jordanlewis commented Jul 24, 2022

View reviewed changes

rytaft reviewed Jul 25, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch from 4e03e3a to defb512 Compare July 25, 2022 22:31

jordanlewis commented Jul 26, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch from defb512 to 979b5fb Compare July 26, 2022 02:35

jordanlewis changed the title ~~opt: ignore non-bytes histograms on inv filters~~ opt: permit coexistance of inv, forward histograms Jul 26, 2022

rytaft approved these changes Jul 26, 2022

View reviewed changes

jordanlewis commented Jul 26, 2022

View reviewed changes

rytaft reviewed Jul 26, 2022

View reviewed changes

jordanlewis commented Jul 26, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch from 979b5fb to f72952b Compare July 27, 2022 12:52

jordanlewis commented Jul 27, 2022

View reviewed changes

rytaft reviewed Jul 27, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from b278752 to f6f54b3 Compare July 28, 2022 04:11

jordanlewis commented Jul 28, 2022

View reviewed changes

rytaft reviewed Jul 28, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch from e8f76c0 to bf119db Compare July 29, 2022 02:14

jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from 9a34714 to 9a7000f Compare July 29, 2022 03:25

jordanlewis commented Jul 29, 2022

View reviewed changes

rytaft approved these changes Jul 29, 2022

View reviewed changes

jordanlewis force-pushed the fix-trigram-forward-index branch from 9a7000f to 963deb8 Compare July 29, 2022 18:50

jordanlewis requested review from mgartner and rytaft July 29, 2022 18:50

jordanlewis commented Jul 29, 2022

View reviewed changes

craig bot merged commit 25d5253 into cockroachdb:master Jul 30, 2022

jordanlewis deleted the fix-trigram-forward-index branch July 30, 2022 17:17

	EncodedKey = &T{
	InternalType: InternalType{
	Family: EncodedKeyFamily,
	Oid: oid.T_unknown,
	Locale: &emptyLocale,
	},
	}

Conversation

jordanlewis commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 22, 2022

Uh oh!

jordanlewis commented Jul 22, 2022

Uh oh!

mgartner left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

jordanlewis Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

mgartner Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

jordanlewis Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

mgartner Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

jordanlewis Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

mgartner Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordanlewis commented Jul 22, 2022

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Jul 28, 2022

Uh oh!

jordanlewis left a comment

jordanlewis commented Jul 22, 2022 •

edited

Loading

mgartner Jul 25, 2022 •

edited

Loading