Skip to content

opt: permit coexistance of inv, forward histograms#84927

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
jordanlewis:fix-trigram-forward-index
Jul 30, 2022
Merged

opt: permit coexistance of inv, forward histograms#84927
craig[bot] merged 1 commit intocockroachdb:masterfrom
jordanlewis:fix-trigram-forward-index

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis commented Jul 22, 2022

Closes #84529.

Previously, when trying to use persisted histograms to filter an
inverted column, if the histogram read from disk was a "forward
histogram", the operation would fail with an internal error.

This is because inverted histograms and forward histograms use different
datatypes on disk, and it's not possible to compare a forward histogram
(represented as the datatype of the column) with the inverted constraint
(represented as DBytes).

Now, this problem is corrected by making sure that we only store forward
histogram data in forward histogram column sets, and likewise for
inverted histogram data being only stored in inverted histogram column
sets.

Release note: None

@jordanlewis jordanlewis requested a review from mgartner July 22, 2022 20:24
@jordanlewis jordanlewis requested a review from a team as a code owner July 22, 2022 20:24
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jordanlewis
Copy link
Copy Markdown
Member Author

Note that this is a slightly more precise fix than the one I proposed in #84529. It will only ignore the histogram if the histogram is of the wrong type: in the steady state, the trigram inverted filter will successfully use the trigram inverted histogram when one exists.

I do have a concern: will the reverse problem exist? In other words, once we build an inverted histogram on the trigram-indexed column, will a forward scan use the wrong histogram too?

I couldn't reproduce any problem of this nature, but it surprises me - I'd expect that something would be going wrong of that nature.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this!

statement ok
CREATE TABLE b (a) AS SELECT 'foo';
ANALYZE b;
CREATE INVERTED INDEX ON b(a gin_trgm_ops)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to analyze after the index is created? The inverted column doesn't exist until the index is created.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug was explicitly in this state, where there is an inverted index but a forward histogram on the column that's being inverted indexed. Once you re-analyze, there is no bug. I've added a test case to ensure that both scenarios work okay.

// wouldn't properly reject the histogram. The proper fix, tracked
// in #50655, is to add a type field onto the persisted histogram.
if inputHist.BucketCount() == 0 ||
inputHist.Bucket(0).UpperBound.ResolvedType().Family() == types.BytesFamily {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally the type we use for inverted columns is types.EncodedKey:

EncodedKey = &T{
InternalType: InternalType{
Family: EncodedKeyFamily,
Oid: oid.T_unknown,
Locale: &emptyLocale,
},
}

But I guess we write the type bytes to the persisted histogram?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do - there is some commentary about it where we write the persisted inverted histograms.

// looking at a forward histogram, since inverted histograms always
// are of type DBytes. We cannot use the histogram if this is the
// case.
// Note that this is a sketchy situation (no pun intended), because
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is it sketchy, it means that we won't have accurate row count estimates for queries on trigram indexes. As a result, the optimizer probably won't pick the best plan in a number of cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Copy link
Copy Markdown
Contributor

@mgartner mgartner Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Correct, that's what I meant to say.

@jordanlewis
Copy link
Copy Markdown
Member Author

Hmm, a bunch of tests failed... I'm going to investigate, but I wonder if there is something more fishy underfoot here.

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from 63395ee to 4e03e3a Compare July 23, 2022 19:05
Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)


pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Probably worth adding a stats-specific test for this. Can you add some tests in opt/memo/testdata/stats/scan similar to the tests you added in logic /execbuilder tests?

We can probably fix this issue by adding a check before line 865 or saving the previous value of s.RowCount.


pkg/sql/opt/memo/statistics_builder.go line 884 at r2 (raw file):

							continue
						case types.BytesFamily:
							break LOOP

nit: I'd like to avoid using these GOTO style statements here if possible. I feel like this would be a good candidate for a helper function.

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execbuilder test I added doesn't have stable results - if I dump the histogram after running ANALYZE, the results differ from run to run. I know the statistics collection uses random sampling, but I guess I figured in logic tests we'd fix the seed. Is the proper way to fix this to inject artificial histograms rather than running ANALYZE? And if so, should all of the test cases that need the histograms really just belong in the stats testdata?

Also, in adding the stats test you suggested, I'm noticing that the stats tests have nice histograms with just a few buckets. How do you generate those nice test histograms? Is there a way to ask the stats system to use a lower bucket count?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Probably worth adding a stats-specific test for this. Can you add some tests in opt/memo/testdata/stats/scan similar to the tests you added in logic /execbuilder tests?

We can probably fix this issue by adding a check before line 865 or saving the previous value of s.RowCount.

That's true, we should also avoid editing the row count if we're looking at the wrong kind of histogram. I've fixed that.

But isn't there still the problem in the opposite case, where we're trying to use a histogram for an ordinary, non-inverted scan on a text column, and we see a set of inverted statistics? I think we'd want a guard in the opposite direction, in applyIndexConstraint, that makes sure that we aren't using an inverted histogram for a forward scan.

What I don't understand is why this theoretical problem doesn't manifest as a type error, like the original bug did.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execbuilder test I added doesn't have stable results - if I dump the histogram after running ANALYZE, the results differ from run to run. I know the statistics collection uses random sampling, but I guess I figured in logic tests we'd fix the seed. Is the proper way to fix this to inject artificial histograms rather than running ANALYZE? And if so, should all of the test cases that need the histograms really just belong in the stats testdata?

You could use INJECT STATISTICS, but I think it's also fine to get rid of these execbuilder tests altogether.

Also, in adding the stats test you suggested, I'm noticing that the stats tests have nice histograms with just a few buckets. How do you generate those nice test histograms? Is there a way to ask the stats system to use a lower bucket count?

We don't currently have a way to control the number of histogram buckets collected by CREATE STATISTICS/ANALYZE (it's hardcoded as 200 for index columns, and 2 for other columns). Those histograms in the stats tests were just manually constructed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

That's true, we should also avoid editing the row count if we're looking at the wrong kind of histogram. I've fixed that.

Doesn't look like you've pushed the fix.

But isn't there still the problem in the opposite case, where we're trying to use a histogram for an ordinary, non-inverted scan on a text column, and we see a set of inverted statistics? I think we'd want a guard in the opposite direction, in applyIndexConstraint, that makes sure that we aren't using an inverted histogram for a forward scan.

I think you're right.

What I don't understand is why this theoretical problem doesn't manifest as a type error, like the original bug did.

I'm not sure -- it would be good to add some tests that exercise this.

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch from 4e03e3a to defb512 Compare July 25, 2022 22:31
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for another look, I redid it like we discussed on Slack. Now, the code carefully segments inverted histograms from forward histograms in the statistics map, so there's no chance for code to accidentally get the wrong type of histogram when building plans.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 871 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Yeah :( And even worse, isn't it actually more like we won't have accurate row count estimates for queries against text columns that happen to also be inverted indexed? I think that's the thing I'm most scared of here...

Correct, that's what I meant to say.

Removed this code.


pkg/sql/opt/memo/testdata/stats/inverted-geo line 774 at r3 (raw file):

      │    │    └── stats: [rows=1, distinct(2)=1, null(2)=0, avgsize(2)=4]
      │    │        histogram(2)=  0                                                                                            4.9994e-05                                                                                            0.99988                                                           7.4991e-05
      │    │                     <--- '0107000020E6100000010000000103000020E610000001000000040000008C7D4198BD2B574080A6FD3A111E4D40F0DF86928AA12BC03A59212197A35140CEA3CEE206B863C0FC7649EB60BA53408C7D4198BD2B574080A6FD3A111E4D40' --------- '0102000020E61000000300000005D8E086BB6365C03F9E5737DD1A53C0C04ECDED673B55C06711C00C7C0240C0B8EABD96072856404A9D2C529FC74EC0'

I think the reason for this change is that the test is buggy.

The injected statistics here should have histo_col_type=BYTES, not GEOGRAPHY, unless they were collected on the ordinary column. But I think we don't collect forward statistics on GEOGRAPHY columns. So I think this test is just wrong, thoughts? The fix would be to change the injected stats to be bytes type.

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch from defb512 to 979b5fb Compare July 26, 2022 02:35
@jordanlewis jordanlewis changed the title opt: ignore non-bytes histograms on inv filters opt: permit coexistance of inv, forward histograms Jul 26, 2022
Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for working through these details!

I thought we had discussed iterating through older statistics to find both forward and inverted histograms? Feel free to open an issue and/or save that work for another PR. I think it's probably worth doing at some point, though (even before we fix #50655), since we can limit the extra computation to only cases where it's useful. For example, we can limit it to cases where we have a string column that also has an inverted index. We will (almost) always collect a forward histogram (it will just be smaller if there is no forward index), so we should try to find both the forward and inverted histograms for these columns.

Reviewed 3 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

			if needHistogram {
				invertedColOrds = invertedIndexCols[stat.ColumnOrdinal(0)]
				invertedStatistic = len(invertedColOrds) > 0 && !histogramIsNonBytes(stat.Histogram())

nit: can we remove the double-negative and change this to histogramIsBytes?


pkg/sql/opt/memo/statistics_builder.go line 687 at r4 (raw file):

				// to calling stats.ColStats.Add() on any inverted column statistics
				// created above.
				colStat, _ = stats.ColStats.Lookup(cols)

I don't think you need to do this anymore


pkg/sql/opt/memo/testdata/stats/inverted-geo line 774 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think the reason for this change is that the test is buggy.

The injected statistics here should have histo_col_type=BYTES, not GEOGRAPHY, unless they were collected on the ordinary column. But I think we don't collect forward statistics on GEOGRAPHY columns. So I think this test is just wrong, thoughts? The fix would be to change the injected stats to be bytes type.

Your change looks good

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something, but I thought that we were actually already doing that (and just ignoring all but the most recent one because of the way colStatsMap works:

		for i := 0; i < tab.StatisticCount(); i++ {
			stat := tab.Statistic(i)

As I understood, this loop finds every statistic in the table, including historical statistics.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct -- we are ignoring all but the most recent one for each set of columns. So in order to make sure that we get both types of histograms, I think we can make a relatively small change (I added a comment below).

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

			}

			if colStat, ok := stats.ColStats.Add(cols); ok {

I think we can ensure we get both types of histograms by changing this slightly as follows:

if colStat, ok := stats.ColStats.Add(cols); ok {
  // Keep what's already here
} else if colStat.Histogram == nil && needHistogram && !invertedStatistic {
  col := cols.SingleColumn()
  colStat.Histogram = &props.Histogram{}
  colStat.Histogram.Init(sb.evalCtx, col, stat.Histogram())
}

This takes advantage of the fact that stats.ColStats.Add(cols) returns the existing statistic if ok is false.

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think we can ensure we get both types of histograms by changing this slightly as follows:

if colStat, ok := stats.ColStats.Add(cols); ok {
  // Keep what's already here
} else if colStat.Histogram == nil && needHistogram && !invertedStatistic {
  col := cols.SingleColumn()
  colStat.Histogram = &props.Histogram{}
  colStat.Histogram.Init(sb.evalCtx, col, stat.Histogram())
}

This takes advantage of the fact that stats.ColStats.Add(cols) returns the existing statistic if ok is false.

Sorry for being dense, but I still don't understand what situation this branch defends against that the code doesn't already handle. The code as written does handle finding both the forward and inverted statistic for a given column set regardless of where either is in the list of historical statistics, since the new branch on line 700 gets executed regardless of whether we already have a forward statistic for the column set.

The only situation I can think of where this would change behavior is something like this list of statistics for a column t:

  1. a forward statistic with no histogram at time x
  2. an inverted statistic with a histogram at time x
  3. a forward statistic with a histogram at time x-1

Is this the situation you're thinking of too or am I still missing something?

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch from 979b5fb to f72952b Compare July 27, 2022 12:52
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: can we remove the double-negative and change this to histogramIsBytes?

I'm not sure - would we ever get into a situation in which the histogram contains all null buckets? If so, we wouldn't be able to guarantee that the histogram was all bytes, just that it had no non-bytes buckets. That's why I named it this way.


pkg/sql/opt/memo/statistics_builder.go line 687 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think you need to do this anymore

Done.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'm not sure - would we ever get into a situation in which the histogram contains all null buckets? If so, we wouldn't be able to guarantee that the histogram was all bytes, just that it had no non-bytes buckets. That's why I named it this way.

I think that could happen if every row in the table is NULL. Maybe the better question is whether we would want to default to creating an inverted or a forward histogram in this case. I'm not sure it really matters in most cases, since I think we should be smart enough to just use the NULL counts and row counts directly to get the correct selectivity even without a histogram. The case where it might matter is if you implement my proposed change below to go back and add a forward histogram when one is found. In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

Alternatively, we could update the TableStatistic interface to include a HistogramType function that returns the type directly (then we wouldn't need to guess and you can get rid of the histogramIsNonBytes function). This info should be available with a bit of plumbing. If you think that's a better route, I'm happy to open a separate PR with that change.

Regardless, it would be good to add a test for this case.


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Sorry for being dense, but I still don't understand what situation this branch defends against that the code doesn't already handle. The code as written does handle finding both the forward and inverted statistic for a given column set regardless of where either is in the list of historical statistics, since the new branch on line 700 gets executed regardless of whether we already have a forward statistic for the column set.

The only situation I can think of where this would change behavior is something like this list of statistics for a column t:

  1. a forward statistic with no histogram at time x
  2. an inverted statistic with a histogram at time x
  3. a forward statistic with a histogram at time x-1

Is this the situation you're thinking of too or am I still missing something?

I'm thinking of this:

  1. an inverted statistic on column c with a histogram at time x
  2. a forward statistic on column c with a histogram at time x
  3. ... more stats ...

For the first statistic, we will add a forward stat for column c with no histogram, and an inverted stat with a histogram. Once we get to the second statistic, as the code is currently written, we will see that a forward stat has already been added and it will be a no-op. My proposed change would ensure that we still save the histogram for this case.

As an alternative, you could just not add the forward stat if invertedStatistic is true. That might also work.

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from b278752 to f6f54b3 Compare July 28, 2022 04:11
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think that could happen if every row in the table is NULL. Maybe the better question is whether we would want to default to creating an inverted or a forward histogram in this case. I'm not sure it really matters in most cases, since I think we should be smart enough to just use the NULL counts and row counts directly to get the correct selectivity even without a histogram. The case where it might matter is if you implement my proposed change below to go back and add a forward histogram when one is found. In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

Alternatively, we could update the TableStatistic interface to include a HistogramType function that returns the type directly (then we wouldn't need to guess and you can get rid of the histogramIsNonBytes function). This info should be available with a bit of plumbing. If you think that's a better route, I'm happy to open a separate PR with that change.

Regardless, it would be good to add a test for this case.

Great point, I took a look and your suggestion was very simple to implement (no plumbing was necessary, the actual data was available on both the test catalog and production stat struct as-is) so I just went ahead and did it.

In the case where the data is all null, no histogram is created at all, so I think we're fine.

      {
          "avg_size": 0,
          "columns": [
              "a"
          ],
          "created_at": "2022-07-28 04:05:28.848149",
          "distinct_count": 1,
          "histo_col_type": "",
          "null_count": 3,
          "row_count": 3
      }

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I'm thinking of this:

  1. an inverted statistic on column c with a histogram at time x
  2. a forward statistic on column c with a histogram at time x
  3. ... more stats ...

For the first statistic, we will add a forward stat for column c with no histogram, and an inverted stat with a histogram. Once we get to the second statistic, as the code is currently written, we will see that a forward stat has already been added and it will be a no-op. My proposed change would ensure that we still save the histogram for this case.

As an alternative, you could just not add the forward stat if invertedStatistic is true. That might also work.

Ahh, I see, thanks for spelling it out, you're totally right of course.

We can't not add the forward stat if invertedStatistic is true, because in the case of ordinary inverted index types like JSON, we don't ever collect forward stats on that type - so we'd be missing out on the non-histogram stats.

I think your example snippet is also going to do something slightly weird - it has the potential of creating a statistic that consists of counts from an inverted histogram at time t, and a histogram from a forward histogram at a time t-1 when the counts were different.

The updated code now takes the entirety of the most recent forward statistic that has a histogram... I think that is the optimal thing.

And confirmed this change updated a test case in the trigram stats test file for the better! It now can use a forward histogram for a forward scan even though an inverted histogram on the same column was present.

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):
Wait, is this still going to present a problem in a different case? You mention something similar above:

In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

time t-1: statistic with forward histogram for column c
time t: user truncates table
time t+1: statistic with no histogram because the table is empty

In this case, I think the code I just pushed (and your suggested snippet?) would fall into a similar hazard. The newest statistic is missing a histogram because the table is empty, so we replace it with an older statistic that has a histogram, that's now stale and out of date.

We could avoid this hazard with an extra bit per ColumnStatistic that is set to true if the information was derived from an inverted histogram, I guess...

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Wait, is this still going to present a problem in a different case? You mention something similar above:

In particular, if the most recent stats show a table with all NULLs, but historical stats have some non-nulls, we could end up using stale stats.

time t-1: statistic with forward histogram for column c
time t: user truncates table
time t+1: statistic with no histogram because the table is empty

In this case, I think the code I just pushed (and your suggested snippet?) would fall into a similar hazard. The newest statistic is missing a histogram because the table is empty, so we replace it with an older statistic that has a histogram, that's now stale and out of date.

We could avoid this hazard with an extra bit per ColumnStatistic that is set to true if the information was derived from an inverted histogram, I guess...

I added a commit on top that does this, but it bloats the memo by a bit. PTAL. Bummed that this keeps increasing in complexity.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 666 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Great point, I took a look and your suggestion was very simple to implement (no plumbing was necessary, the actual data was available on both the test catalog and production stat struct as-is) so I just went ahead and did it.

In the case where the data is all null, no histogram is created at all, so I think we're fine.

      {
          "avg_size": 0,
          "columns": [
              "a"
          ],
          "created_at": "2022-07-28 04:05:28.848149",
          "distinct_count": 1,
          "histo_col_type": "",
          "null_count": 3,
          "row_count": 3
      }

Nice!


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I added a commit on top that does this, but it bloats the memo by a bit. PTAL. Bummed that this keeps increasing in complexity.

Is there a way to avoid storing the extra bit in the ColStatsMap since we don't need this info outside of this function? One idea is to just check if we already have an inverted stat for this column. e.g., make a function like this:

func alreadyHasInvertedStat(invColOrds int[], stats *props.Statistics) {
  if len(invColOrds) > 0 {
    invCol := tabID.ColumnID(invColOrds[0])
    invCols := opt.MakeColSet(invCol)
    _, ok := stats.ColStats.Lookup(invCols)
    return ok
  }
  return false
}

@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Jul 28, 2022

pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is there a way to avoid storing the extra bit in the ColStatsMap since we don't need this info outside of this function? One idea is to just check if we already have an inverted stat for this column. e.g., make a function like this:

func alreadyHasInvertedStat(invColOrds int[], stats *props.Statistics) {
  if len(invColOrds) > 0 {
    invCol := tabID.ColumnID(invColOrds[0])
    invCols := opt.MakeColSet(invCol)
    _, ok := stats.ColStats.Lookup(invCols)
    return ok
  }
  return false
}

(BTW I didn't check if that compiles -- looks like there's at least a couple fixes needed)

@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch from e8f76c0 to bf119db Compare July 29, 2022 02:14
@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch 2 times, most recently from 9a34714 to 9a7000f Compare July 29, 2022 03:25
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 669 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

(BTW I didn't check if that compiles -- looks like there's at least a couple fixes needed)

I moved the state into a map that we're already keeping around in the statsbuilder loop and added more tests that ensure we don't use old histograms even if new statistics have no histograms at all for both the inverted and forward case.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: thanks for all your work on this!

Reviewed 1 of 5 files at r3, 14 of 14 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)


pkg/sql/opt/memo/statistics_builder.go line 686 at r9 (raw file):

				//    If these conditions hold, it means that the previous histogram
				//    we found for the current colset was derived from an inverted
				//    histogram, and therefore doesn't have a histogram at all, and the

nit: and therefore doesn't -> and therefore the existing forward statistic doesn't


pkg/sql/opt/memo/statistics_builder.go line 715 at r9 (raw file):

			// Add inverted histograms if necessary.
			if needHistogram && invertedStatistic {
				// Record the fact that we just added an inverted statistic to this

nit: just added -> are adding

Previously, when trying to use persisted histograms to filter an
inverted column, if the histogram read from disk was a "forward
histogram", the operation would fail with an internal error.

This is because inverted histograms and forward histograms use different
datatypes on disk, and it's not possible to compare a forward histogram
(represented as the datatype of the column) with the inverted constraint
(represented as DBytes).

Now, this problem is corrected by making sure that we only store forward
histogram data in forward histogram column sets, and likewise for
inverted histogram data being only stored in inverted histogram column
sets.

Release note: None
@jordanlewis jordanlewis force-pushed the fix-trigram-forward-index branch from 9a7000f to 963deb8 Compare July 29, 2022 18:50
@jordanlewis jordanlewis requested review from mgartner and rytaft July 29, 2022 18:50
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patient reviews!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 30, 2022

Build succeeded:

@craig craig bot merged commit 25d5253 into cockroachdb:master Jul 30, 2022
@jordanlewis jordanlewis deleted the fix-trigram-forward-index branch July 30, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: forward index histogram on an trigram inverted index column causes internal errors

4 participants