Skip to content

sql: support histograms for types with inverted indexes#50238

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
madelynnblue:inv-hist
Jun 22, 2020
Merged

sql: support histograms for types with inverted indexes#50238
craig[bot] merged 1 commit intocockroachdb:masterfrom
madelynnblue:inv-hist

Conversation

@madelynnblue
Copy link
Copy Markdown
Contributor

Support inverted index keys by adding some new fields to the
samplers. They now have maps from column IDs to sketches and sample
reservoirs that allow each inverted index to sketch and sample individual
keys from the inverted index. Two new row types are now being sent to
the aggregator to distribute this new data.

This design keeps the existing and single table reader and computes
the index entries from each datum, requiring it to have new forms of
communication (i.e., the 2 new columns and 2 new row types) over the
sampler -> aggregator connection. We also considered a design where we
read the PK + each inverted index. Our assumption was that in the tradeoff
between using extra CPU to compute each inverted key and using extra disk
to read each inverted index, using more CPU was perhaps faster overall.

Fixes #48219

Release note: None

@madelynnblue madelynnblue requested a review from rytaft June 15, 2020 19:15
@madelynnblue madelynnblue requested a review from a team as a code owner June 15, 2020 19:15
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this!

[nit] in the title of the commit/PR "histograms types" -> "histograms for types"?

Reviewed 13 of 13 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson)


pkg/sql/distsql_plan_stats.go, line 132 at r1 (raw file):

				return nil, err
			}
			if sqlbase.ColumnTypeIsInvertedIndexable(col.Type) {

I feel like this stuff should happen in create_stats.go, and maybe only if an inverted index exists on the given column.


pkg/sql/logictest/testdata/logic_test/distsql_stats, line 581 at r1 (raw file):

arr_stats        {x}           4          3               1

# Regression test for #46964. Do not try to create a histogram on the array column.

This comment needs an update


pkg/sql/opt/exec/execbuilder/testdata/inv_stats, line 1 at r1 (raw file):

# LogicTest: 5node

I think this file should go in logictests (execbuilder tests are sort of reserved for explain plan output, trace output, etc).


pkg/sql/rowexec/sample_aggregator.go, line 70 at r1 (raw file):

	// The sample aggregator tracks sketches and reservoirs for inverted
	// index keys. These maps are indexed by column ID.
	invSr     map[uint32]*stats.SampleReservoir

Why not use a slice instead of a map (here and below), similar to the case for non-inverted columns?

If you leave it as a map, though, shouldn't it be keyed by a sqlbase.ColumnID?


pkg/sql/rowexec/sample_aggregator.go, line 148 at r1 (raw file):

		var sr stats.SampleReservoir
		// The datums are converted to their inverted index bytes and
		// sent as single DBytes column.

[nit] as single -> as a single


pkg/sql/rowexec/sample_aggregator.go, line 275 at r1 (raw file):

		}

		// There are four kinds of rows normal and inverted samples and

This sentence is missing a word or two...


pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):

			if rank, err := row[s.rankCol].GetInt(); err == nil {
				// Inverted sample row.
				if err != nil {

err == nil was just checked above....


pkg/sql/rowexec/sample_aggregator.go, line 289 at r1 (raw file):

				}
				// Retain the rows with the top ranks.
				if err := s.invSr[colIdx].SampleRow(ctx, s.EvalCtx, row[s.invIdxKeyCol:s.invIdxKeyCol+1], uint64(rank)); err != nil {

[nit] long line


pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):

			}
			// Inverted sketch row.
			invSketch, ok := s.invSketch[colIdx]

What's happening to all this inverted sketch data? Looks like it's not getting stored anywhere. I'm wondering if it's worth adding a boolean column to the system.table_statistics table to indicate whether or not the stat is inverted. Then the histogram should also be associated with the inverted stat, and not the normal stat.

I know altering a system table is kind of a pain, but it seems to me like it might be worth it in this case. cc @RaduBerinde for your opinion.


pkg/sql/rowexec/sample_aggregator.go, line 332 at r1 (raw file):

				return false, errors.NewAssertionErrorWithWrappedErrf(err, "merging sketch data")
			}
			continue

Seems like there's a lot of duplication between the inverted and regular cases -- would be great if you could extract the common bits into a couple of helper functions.


pkg/sql/rowexec/sample_aggregator.go, line 336 at r1 (raw file):

		if rank, err := row[s.rankCol].GetInt(); err == nil {
			// Sample row.
			if err != nil {

err == nil was just checked above.


pkg/sql/rowexec/sample_aggregator.go, line 447 at r1 (raw file):

					s.EvalCtx,
					invSr.Get(),
					0, /* colIdx */

why is this 0? Don't we want the inverted column ID here?


pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):

	outTypes = append(outTypes, types.Bytes)

	// An INT column indicating the inverted sketch index.

here you're saying sketch index, but in other places you're saying inverted column ID...


pkg/sql/rowexec/sampler.go, line 246 at r1 (raw file):

	// Inverted index variables for EncodeInvertedIndexKeys.
	invIndexDesc := &sqlbase.IndexDescriptor{

If the column has an inverted index, you should probably use the real descriptor (feel free to add a TODO if you don't want to deal with it now).


pkg/sql/rowexec/sampler.go, line 386 at r1 (raw file):

					if !emitHelper(ctx, &s.Out, nil /* row */, meta, s.pushTrailingMeta, s.input) {
						return true, nil
					}

Feels like this stuff could go into a helper function to reduce duplication a bit...


pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):

		if !emitHelper(ctx, &s.Out, outRow, nil /* meta */, s.pushTrailingMeta, s.input) {
			return true, nil
		}

another place that could benefit from a helper function...


pkg/sql/rowexec/sampler.go, line 478 at r1 (raw file):

func (s *sketchInfo) addRow(
	row sqlbase.EncDatumRow, typs []*types.T, buf *[]byte, da *sqlbase.DatumAlloc,

why not just pass buf as a []byte?

@madelynnblue madelynnblue changed the title sql: support histograms types with inverted indexes sql: support histograms for types with inverted indexes Jun 16, 2020
Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the .proto changes so far - we need to explain all the changes to the "protocol" there; it should be clear how the processors communicate from just reading the protos.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson)


pkg/sql/execinfrapb/processors_table_stats.proto, line 73 at r1 (raw file):

//       - an INT column with the "rank" of the row; this is a random value
//         associated with the row (necessary for combining sample sets).
//   2. sketch columns:

Some explanation related to inverted_sketches is needed here.. We are adding two new "sketch columns" right?


pkg/sql/execinfrapb/processors_table_stats.proto, line 97 at r1 (raw file):

  optional double max_fraction_idle = 3 [(gogoproto.nullable) = false];

  repeated SketchSpec inverted_sketches = 4 [(gogoproto.nullable) = false];

[nit] I'd put this next to sketches even if the ids are not in order


pkg/sql/execinfrapb/processors_table_stats.proto, line 148 at r1 (raw file):

  optional uint64 rows_expected = 7 [(gogoproto.nullable) = false];

  repeated SketchSpec inverted_sketches = 8 [(gogoproto.nullable) = false];

[nit] I'd put this next to sketches even if the ids are not in order

Copy link
Copy Markdown
Contributor Author

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @RaduBerinde, and @rytaft)


pkg/sql/distsql_plan_stats.go, line 132 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I feel like this stuff should happen in create_stats.go, and maybe only if an inverted index exists on the given column.

Done.


pkg/sql/logictest/testdata/logic_test/distsql_stats, line 581 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This comment needs an update

Done.


pkg/sql/rowexec/sample_aggregator.go, line 70 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Why not use a slice instead of a map (here and below), similar to the case for non-inverted columns?

If you leave it as a map, though, shouldn't it be keyed by a sqlbase.ColumnID?

I looked again and this is actually an index, not a ColumnID. I've fixed the comments to correctly describe it.

I believe it's easier as a map because that allows for easy iteration through the needed columns when computing the inverted keys for each row.


pkg/sql/rowexec/sample_aggregator.go, line 148 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] as single -> as a single

Done.


pkg/sql/rowexec/sample_aggregator.go, line 275 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This sentence is missing a word or two...

Done.


pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

err == nil was just checked above....

Done.


pkg/sql/rowexec/sample_aggregator.go, line 289 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] long line

Done.


pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What's happening to all this inverted sketch data? Looks like it's not getting stored anywhere. I'm wondering if it's worth adding a boolean column to the system.table_statistics table to indicate whether or not the stat is inverted. Then the histogram should also be associated with the inverted stat, and not the normal stat.

I know altering a system table is kind of a pain, but it seems to me like it might be worth it in this case. cc @RaduBerinde for your opinion.

The inverted sketch data is used below in the writeResults function. We need it when generating histograms for distinct, null, and total row counts.

I'm not going to merge this PR until I understand more of what the optimizer needs during costing and stats building of inverted index stats, with or without an index. That should help inform if the full inverted sketch also needs to be saved.


pkg/sql/rowexec/sample_aggregator.go, line 332 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Seems like there's a lot of duplication between the inverted and regular cases -- would be great if you could extract the common bits into a couple of helper functions.

Done.


pkg/sql/rowexec/sample_aggregator.go, line 336 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

err == nil was just checked above.

Done.


pkg/sql/rowexec/sample_aggregator.go, line 447 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why is this 0? Don't we want the inverted column ID here?

We want the column index of the passed sample row of inverted index keys, which only has a single bytes column. Added a comment to explain.


pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

here you're saying sketch index, but in other places you're saying inverted column ID...

Done.


pkg/sql/rowexec/sampler.go, line 246 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

If the column has an inverted index, you should probably use the real descriptor (feel free to add a TODO if you don't want to deal with it now).

Added a TODO.


pkg/sql/rowexec/sampler.go, line 386 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Feels like this stuff could go into a helper function to reduce duplication a bit...

Done.


pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

another place that could benefit from a helper function...

Done.


pkg/sql/rowexec/sampler.go, line 478 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why not just pass buf as a []byte?

Because when the passed buf gets enlarged it will allocate a new portion of memory that the calling function won't know about and thus be unable to reuse.


pkg/sql/opt/exec/execbuilder/testdata/inv_stats, line 1 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think this file should go in logictests (execbuilder tests are sort of reserved for explain plan output, trace output, etc).

Done.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson)


pkg/sql/execinfrapb/processors_table_stats.proto, line 60 at r2 (raw file):

// sample_size sampled rows.
//
// For each column with an inverted index, a sketch and sampler are

sampler -> sample reservoir?


pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Done.

Maybe this error should now be thrown in an else clause?


pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

The inverted sketch data is used below in the writeResults function. We need it when generating histograms for distinct, null, and total row counts.

I'm not going to merge this PR until I understand more of what the optimizer needs during costing and stats building of inverted index stats, with or without an index. That should help inform if the full inverted sketch also needs to be saved.

Ok, sounds good


pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Done.

I'm a bit confused by what "row column" means


pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Done.

I don't see this one -- maybe add a function for sketchRow?

Copy link
Copy Markdown
Contributor Author

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/execinfrapb/processors_table_stats.proto, line 60 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

sampler -> sample reservoir?

Done.


pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Maybe this error should now be thrown in an else clause?

Which error? The ones for invColIdx and rankCol are expected and not actual errors, we're using errors to see if the column was not-null, but it's also fine if they are null.


pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I'm a bit confused by what "row column" means

Reworded.


pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't see this one -- maybe add a function for sketchRow?

Oops, misread your comment. Yes, now done.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice work!

Reviewed 6 of 6 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Which error? The ones for invColIdx and rankCol are expected and not actual errors, we're using errors to see if the column was not-null, but it's also fine if they are null.

Ah never mind -- I was confused.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Well done!

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mjibson)


pkg/sql/execinfrapb/processors_table_stats.proto, line 84 at r3 (raw file):

//       - a BYTES column with the binary sketch data (format
//         dependent on the sketch type).
//   3. inverted sampled row columns:

I'm confused. This comment sounds like we are adding 7 columns, but we are only adding 2, no? We should describe the schema first (the columns that are produced), and then we can list which columns are used for each of the four cases


pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Ok, sounds good

We can add a column, it shouldn't be a problem. We can also use "special" columnIDs to indicate inverted stuff (e.g. negative values).


pkg/sql/rowexec/sample_aggregator.go, line 329 at r3 (raw file):

}

func (s *sampleAggregator) sketchRow(

[nit] processSketchRow


pkg/sql/rowexec/sampler.go, line 164 at r3 (raw file):

	s.sr.Init(int(spec.SampleSize), inTypes, &s.memAcc, sampleCols)

	outTypes := make([]*types.T, 0, len(inTypes)+5)
  • 7

pkg/sql/rowexec/sampler.go, line 357 at r3 (raw file):

			}
			for _, key := range invKeys {
				invRow[0].Datum = tree.NewDBytes(tree.DBytes(key))

Should use da.NewDBytes


pkg/sql/rowexec/sampler.go, line 432 at r3 (raw file):

}

func (s *samplerProcessor) sketchRow(

[nit] emitSketchRow


pkg/sql/rowexec/sampler.go, line 448 at r3 (raw file):

}

func (s *samplerProcessor) sampleRow(

[nit] could use a comment (can be similar to sr.SampleRow)


pkg/sql/rowexec/sampler.go, line 481 at r3 (raw file):

}

func (s *sketchInfo) addRow(

[nit] add comment

Support inverted index keys by adding some new fields to the
samplers. They now have maps from column IDs to sketches and sample
reservoirs that allow each inverted index to sketch and sample individual
keys from the inverted index. Two new row types are now being sent to
the aggregator to distribute this new data.

This design keeps the existing and single table reader and computes
the index entries from each datum, requiring it to have new forms of
communication (i.e., the 2 new columns and 2 new row types) over the
sampler -> aggregator connection. We also considered a design where we
read the PK + each inverted index. Our assumption was that in the tradeoff
between using extra CPU to compute each inverted key and using extra disk
to read each inverted index, using more CPU was perhaps faster overall.

Fixes #48219

Release note: None
Copy link
Copy Markdown
Contributor Author

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @RaduBerinde and @rytaft)


pkg/sql/execinfrapb/processors_table_stats.proto, line 84 at r3 (raw file):

Previously, RaduBerinde wrote…

I'm confused. This comment sounds like we are adding 7 columns, but we are only adding 2, no? We should describe the schema first (the columns that are produced), and then we can list which columns are used for each of the four cases

Yes. I kind of combined columns and row here. I'll split them out.

@madelynnblue
Copy link
Copy Markdown
Contributor Author

bors r+

@madelynnblue
Copy link
Copy Markdown
Contributor Author

I said early I wouldn't merge this yet, but I want to do some other work and think it'll be better than trying to keep this up to date.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 22, 2020

Build succeeded

@craig craig bot merged commit c7fadb7 into cockroachdb:master Jun 22, 2020
@madelynnblue madelynnblue deleted the inv-hist branch June 22, 2020 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opt: collect stats on inverted indexes

4 participants