sql: support histograms for types with inverted indexes#50238
sql: support histograms for types with inverted indexes#50238craig[bot] merged 1 commit intocockroachdb:masterfrom madelynnblue:inv-hist
Conversation
rytaft
left a comment
There was a problem hiding this comment.
Exciting to see this!
[nit] in the title of the commit/PR "histograms types" -> "histograms for types"?
Reviewed 13 of 13 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mjibson)
pkg/sql/distsql_plan_stats.go, line 132 at r1 (raw file):
return nil, err } if sqlbase.ColumnTypeIsInvertedIndexable(col.Type) {
I feel like this stuff should happen in create_stats.go, and maybe only if an inverted index exists on the given column.
pkg/sql/logictest/testdata/logic_test/distsql_stats, line 581 at r1 (raw file):
arr_stats {x} 4 3 1 # Regression test for #46964. Do not try to create a histogram on the array column.
This comment needs an update
pkg/sql/opt/exec/execbuilder/testdata/inv_stats, line 1 at r1 (raw file):
# LogicTest: 5node
I think this file should go in logictests (execbuilder tests are sort of reserved for explain plan output, trace output, etc).
pkg/sql/rowexec/sample_aggregator.go, line 70 at r1 (raw file):
// The sample aggregator tracks sketches and reservoirs for inverted // index keys. These maps are indexed by column ID. invSr map[uint32]*stats.SampleReservoir
Why not use a slice instead of a map (here and below), similar to the case for non-inverted columns?
If you leave it as a map, though, shouldn't it be keyed by a sqlbase.ColumnID?
pkg/sql/rowexec/sample_aggregator.go, line 148 at r1 (raw file):
var sr stats.SampleReservoir // The datums are converted to their inverted index bytes and // sent as single DBytes column.
[nit] as single -> as a single
pkg/sql/rowexec/sample_aggregator.go, line 275 at r1 (raw file):
} // There are four kinds of rows normal and inverted samples and
This sentence is missing a word or two...
pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):
if rank, err := row[s.rankCol].GetInt(); err == nil { // Inverted sample row. if err != nil {
err == nil was just checked above....
pkg/sql/rowexec/sample_aggregator.go, line 289 at r1 (raw file):
} // Retain the rows with the top ranks. if err := s.invSr[colIdx].SampleRow(ctx, s.EvalCtx, row[s.invIdxKeyCol:s.invIdxKeyCol+1], uint64(rank)); err != nil {
[nit] long line
pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):
} // Inverted sketch row. invSketch, ok := s.invSketch[colIdx]
What's happening to all this inverted sketch data? Looks like it's not getting stored anywhere. I'm wondering if it's worth adding a boolean column to the system.table_statistics table to indicate whether or not the stat is inverted. Then the histogram should also be associated with the inverted stat, and not the normal stat.
I know altering a system table is kind of a pain, but it seems to me like it might be worth it in this case. cc @RaduBerinde for your opinion.
pkg/sql/rowexec/sample_aggregator.go, line 332 at r1 (raw file):
return false, errors.NewAssertionErrorWithWrappedErrf(err, "merging sketch data") } continue
Seems like there's a lot of duplication between the inverted and regular cases -- would be great if you could extract the common bits into a couple of helper functions.
pkg/sql/rowexec/sample_aggregator.go, line 336 at r1 (raw file):
if rank, err := row[s.rankCol].GetInt(); err == nil { // Sample row. if err != nil {
err == nil was just checked above.
pkg/sql/rowexec/sample_aggregator.go, line 447 at r1 (raw file):
s.EvalCtx, invSr.Get(), 0, /* colIdx */
why is this 0? Don't we want the inverted column ID here?
pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):
outTypes = append(outTypes, types.Bytes) // An INT column indicating the inverted sketch index.
here you're saying sketch index, but in other places you're saying inverted column ID...
pkg/sql/rowexec/sampler.go, line 246 at r1 (raw file):
// Inverted index variables for EncodeInvertedIndexKeys. invIndexDesc := &sqlbase.IndexDescriptor{
If the column has an inverted index, you should probably use the real descriptor (feel free to add a TODO if you don't want to deal with it now).
pkg/sql/rowexec/sampler.go, line 386 at r1 (raw file):
if !emitHelper(ctx, &s.Out, nil /* row */, meta, s.pushTrailingMeta, s.input) { return true, nil }
Feels like this stuff could go into a helper function to reduce duplication a bit...
pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):
if !emitHelper(ctx, &s.Out, outRow, nil /* meta */, s.pushTrailingMeta, s.input) { return true, nil }
another place that could benefit from a helper function...
pkg/sql/rowexec/sampler.go, line 478 at r1 (raw file):
func (s *sketchInfo) addRow( row sqlbase.EncDatumRow, typs []*types.T, buf *[]byte, da *sqlbase.DatumAlloc,
why not just pass buf as a []byte?
RaduBerinde
left a comment
There was a problem hiding this comment.
Only looked at the .proto changes so far - we need to explain all the changes to the "protocol" there; it should be clear how the processors communicate from just reading the protos.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mjibson)
pkg/sql/execinfrapb/processors_table_stats.proto, line 73 at r1 (raw file):
// - an INT column with the "rank" of the row; this is a random value // associated with the row (necessary for combining sample sets). // 2. sketch columns:
Some explanation related to inverted_sketches is needed here.. We are adding two new "sketch columns" right?
pkg/sql/execinfrapb/processors_table_stats.proto, line 97 at r1 (raw file):
optional double max_fraction_idle = 3 [(gogoproto.nullable) = false]; repeated SketchSpec inverted_sketches = 4 [(gogoproto.nullable) = false];
[nit] I'd put this next to sketches even if the ids are not in order
pkg/sql/execinfrapb/processors_table_stats.proto, line 148 at r1 (raw file):
optional uint64 rows_expected = 7 [(gogoproto.nullable) = false]; repeated SketchSpec inverted_sketches = 8 [(gogoproto.nullable) = false];
[nit] I'd put this next to sketches even if the ids are not in order
madelynnblue
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @RaduBerinde, and @rytaft)
pkg/sql/distsql_plan_stats.go, line 132 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I feel like this stuff should happen in
create_stats.go, and maybe only if an inverted index exists on the given column.
Done.
pkg/sql/logictest/testdata/logic_test/distsql_stats, line 581 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
This comment needs an update
Done.
pkg/sql/rowexec/sample_aggregator.go, line 70 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Why not use a slice instead of a map (here and below), similar to the case for non-inverted columns?
If you leave it as a map, though, shouldn't it be keyed by a
sqlbase.ColumnID?
I looked again and this is actually an index, not a ColumnID. I've fixed the comments to correctly describe it.
I believe it's easier as a map because that allows for easy iteration through the needed columns when computing the inverted keys for each row.
pkg/sql/rowexec/sample_aggregator.go, line 148 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
[nit] as single -> as a single
Done.
pkg/sql/rowexec/sample_aggregator.go, line 275 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
This sentence is missing a word or two...
Done.
pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
err == nilwas just checked above....
Done.
pkg/sql/rowexec/sample_aggregator.go, line 289 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
[nit] long line
Done.
pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
What's happening to all this inverted sketch data? Looks like it's not getting stored anywhere. I'm wondering if it's worth adding a boolean column to the
system.table_statisticstable to indicate whether or not the stat is inverted. Then the histogram should also be associated with the inverted stat, and not the normal stat.I know altering a system table is kind of a pain, but it seems to me like it might be worth it in this case. cc @RaduBerinde for your opinion.
The inverted sketch data is used below in the writeResults function. We need it when generating histograms for distinct, null, and total row counts.
I'm not going to merge this PR until I understand more of what the optimizer needs during costing and stats building of inverted index stats, with or without an index. That should help inform if the full inverted sketch also needs to be saved.
pkg/sql/rowexec/sample_aggregator.go, line 332 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Seems like there's a lot of duplication between the inverted and regular cases -- would be great if you could extract the common bits into a couple of helper functions.
Done.
pkg/sql/rowexec/sample_aggregator.go, line 336 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
err == nilwas just checked above.
Done.
pkg/sql/rowexec/sample_aggregator.go, line 447 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
why is this 0? Don't we want the inverted column ID here?
We want the column index of the passed sample row of inverted index keys, which only has a single bytes column. Added a comment to explain.
pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
here you're saying sketch index, but in other places you're saying inverted column ID...
Done.
pkg/sql/rowexec/sampler.go, line 246 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
If the column has an inverted index, you should probably use the real descriptor (feel free to add a TODO if you don't want to deal with it now).
Added a TODO.
pkg/sql/rowexec/sampler.go, line 386 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Feels like this stuff could go into a helper function to reduce duplication a bit...
Done.
pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
another place that could benefit from a helper function...
Done.
pkg/sql/rowexec/sampler.go, line 478 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
why not just pass
bufas a[]byte?
Because when the passed buf gets enlarged it will allocate a new portion of memory that the calling function won't know about and thus be unable to reuse.
pkg/sql/opt/exec/execbuilder/testdata/inv_stats, line 1 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I think this file should go in
logictests(execbuildertests are sort of reserved for explain plan output, trace output, etc).
Done.
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 10 of 10 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mjibson)
pkg/sql/execinfrapb/processors_table_stats.proto, line 60 at r2 (raw file):
// sample_size sampled rows. // // For each column with an inverted index, a sketch and sampler are
sampler -> sample reservoir?
pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
Done.
Maybe this error should now be thrown in an else clause?
pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
The inverted sketch data is used below in the writeResults function. We need it when generating histograms for distinct, null, and total row counts.
I'm not going to merge this PR until I understand more of what the optimizer needs during costing and stats building of inverted index stats, with or without an index. That should help inform if the full inverted sketch also needs to be saved.
Ok, sounds good
pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
Done.
I'm a bit confused by what "row column" means
pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
Done.
I don't see this one -- maybe add a function for sketchRow?
madelynnblue
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/execinfrapb/processors_table_stats.proto, line 60 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
sampler -> sample reservoir?
Done.
pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Maybe this error should now be thrown in an else clause?
Which error? The ones for invColIdx and rankCol are expected and not actual errors, we're using errors to see if the column was not-null, but it's also fine if they are null.
pkg/sql/rowexec/sampler.go, line 189 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I'm a bit confused by what "row column" means
Reworded.
pkg/sql/rowexec/sampler.go, line 456 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I don't see this one -- maybe add a function for
sketchRow?
Oops, misread your comment. Yes, now done.
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained
pkg/sql/rowexec/sample_aggregator.go, line 285 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
Which error? The ones for invColIdx and rankCol are expected and not actual errors, we're using errors to see if the column was not-null, but it's also fine if they are null.
Ah never mind -- I was confused.
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @mjibson)
pkg/sql/execinfrapb/processors_table_stats.proto, line 84 at r3 (raw file):
// - a BYTES column with the binary sketch data (format // dependent on the sketch type). // 3. inverted sampled row columns:
I'm confused. This comment sounds like we are adding 7 columns, but we are only adding 2, no? We should describe the schema first (the columns that are produced), and then we can list which columns are used for each of the four cases
pkg/sql/rowexec/sample_aggregator.go, line 301 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Ok, sounds good
We can add a column, it shouldn't be a problem. We can also use "special" columnIDs to indicate inverted stuff (e.g. negative values).
pkg/sql/rowexec/sample_aggregator.go, line 329 at r3 (raw file):
} func (s *sampleAggregator) sketchRow(
[nit] processSketchRow
pkg/sql/rowexec/sampler.go, line 164 at r3 (raw file):
s.sr.Init(int(spec.SampleSize), inTypes, &s.memAcc, sampleCols) outTypes := make([]*types.T, 0, len(inTypes)+5)
- 7
pkg/sql/rowexec/sampler.go, line 357 at r3 (raw file):
} for _, key := range invKeys { invRow[0].Datum = tree.NewDBytes(tree.DBytes(key))
Should use da.NewDBytes
pkg/sql/rowexec/sampler.go, line 432 at r3 (raw file):
} func (s *samplerProcessor) sketchRow(
[nit] emitSketchRow
pkg/sql/rowexec/sampler.go, line 448 at r3 (raw file):
} func (s *samplerProcessor) sampleRow(
[nit] could use a comment (can be similar to sr.SampleRow)
pkg/sql/rowexec/sampler.go, line 481 at r3 (raw file):
} func (s *sketchInfo) addRow(
[nit] add comment
Support inverted index keys by adding some new fields to the samplers. They now have maps from column IDs to sketches and sample reservoirs that allow each inverted index to sketch and sample individual keys from the inverted index. Two new row types are now being sent to the aggregator to distribute this new data. This design keeps the existing and single table reader and computes the index entries from each datum, requiring it to have new forms of communication (i.e., the 2 new columns and 2 new row types) over the sampler -> aggregator connection. We also considered a design where we read the PK + each inverted index. Our assumption was that in the tradeoff between using extra CPU to compute each inverted key and using extra disk to read each inverted index, using more CPU was perhaps faster overall. Fixes #48219 Release note: None
madelynnblue
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @RaduBerinde and @rytaft)
pkg/sql/execinfrapb/processors_table_stats.proto, line 84 at r3 (raw file):
Previously, RaduBerinde wrote…
I'm confused. This comment sounds like we are adding 7 columns, but we are only adding 2, no? We should describe the schema first (the columns that are produced), and then we can list which columns are used for each of the four cases
Yes. I kind of combined columns and row here. I'll split them out.
|
bors r+ |
|
I said early I wouldn't merge this yet, but I want to do some other work and think it'll be better than trying to keep this up to date. |
Build succeeded |
Support inverted index keys by adding some new fields to the
samplers. They now have maps from column IDs to sketches and sample
reservoirs that allow each inverted index to sketch and sample individual
keys from the inverted index. Two new row types are now being sent to
the aggregator to distribute this new data.
This design keeps the existing and single table reader and computes
the index entries from each datum, requiring it to have new forms of
communication (i.e., the 2 new columns and 2 new row types) over the
sampler -> aggregator connection. We also considered a design where we
read the PK + each inverted index. Our assumption was that in the tradeoff
between using extra CPU to compute each inverted key and using extra disk
to read each inverted index, using more CPU was perhaps faster overall.
Fixes #48219
Release note: None