Skip to content

sql: enable simple stats for all types#49801

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
madelynnblue:inv-stats
Jun 9, 2020
Merged

sql: enable simple stats for all types#49801
craig[bot] merged 1 commit intocockroachdb:masterfrom
madelynnblue:inv-stats

Conversation

@madelynnblue
Copy link
Copy Markdown
Contributor

Improve datum fingerprinting (used by rowexec/sampler.go) to work
correctly for all types. Previously, due to the hardcoded assumption that
JSON was the only non-key encodable type, it would error if presented
a geo/geom type.

Use MustBeValueEncoded when fingerprinting and determining if a type can
create a histogram. This should be future proof if we add new types or
teach types how to key encode.

Fixes #35844
Informs #48219

Release note: None

@madelynnblue madelynnblue requested review from rytaft and yuzefovich June 2, 2020 16:28
@madelynnblue madelynnblue requested a review from a team as a code owner June 2, 2020 16:28
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@madelynnblue
Copy link
Copy Markdown
Contributor Author

@yuzefovich can you review the change to Fingerprint?

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @rytaft, and @yuzefovich)


pkg/sql/sqlbase/encoded_datum.go, line 291 at r1 (raw file):

	// case uses ed.Encode, which has a fast path if the encoded bytes are already
	// the right encoding.
	if MustBeValueEncoded(typ) {

I think @rohany made some recent changes here so that ArrayFamily would use ed.Encode method below although it "must be value encoded".

@rohany
Copy link
Copy Markdown
Contributor

rohany commented Jun 2, 2020

I think @rohany made some recent changes here so that ArrayFamily would use ed.Encode method below although it "must be value encoded".

Nah, arrays have key encodings now.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @rytaft)


pkg/sql/sqlbase/encoded_datum.go, line 291 at r1 (raw file):

Previously, yuzefovich wrote…

I think @rohany made some recent changes here so that ArrayFamily would use ed.Encode method below although it "must be value encoded".

So I guess this change looks good, right Rohan?

@rohany
Copy link
Copy Markdown
Contributor

rohany commented Jun 2, 2020

yeah

Copy link
Copy Markdown
Contributor Author

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rohany and @rytaft)


pkg/sql/sqlbase/encoded_datum.go, line 291 at r1 (raw file):

Previously, yuzefovich wrote…

So I guess this change looks good, right Rohan?

Arrays now only must value encode if their contents must:

func MustBeValueEncoded(semanticType *types.T) bool {

@madelynnblue
Copy link
Copy Markdown
Contributor Author

madelynnblue commented Jun 2, 2020

Ok sql exec people. Here's a stumper. The Fingerprint change appears to have broken a few tests.

SELECT * FROM kv WHERE (k,v) IN (SELECT * FROM kv)

is now returning 0 rows only in the fakedist-vec-auto-disk configuration.

query error couldn't find WITH expression \"new_values\" with ID 1

no longer produces an error but is reported as success in the fakedist-vec-auto-disk and fakedist-disk configurations. Any idea why that might happen? It is possible the Fingerprint change has exposed some places where there are some incorrect assumptions about uhh things.

For example. sqlbase.MustBeValueEncoded says that tuples must be value encoded:

case types.JsonFamily, types.TupleFamily, types.GeographyFamily, types.GeometryFamily:

and yet the above query for where uses the key encoding in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/sqlbase/column_type_encoding.go#L148 due to some disk-backed hash join thing, and even though tuples don't have a corresponding decode method. Also, their key encode method is highly suspect. (I discovered this in a separate PR I was working on trying to make sense of MustBeValueEncoded, EncodeTableKey, and DecodeTableKey, none of which agree on what things must be key encoded, see https://github.com/mjibson/cockroach/tree/enc-tests for details. It surprisingly showed up here too, and so it looks like my investigation now needs to be completed instead of abandoned.)

Questions:

  1. Why is the tuple key encoding used in the disk hash joiner if MustBeValueEncoded returns true for tuples?
  2. Why does the tuple key encoding exist since it has no corresponding decode method?
  3. Why did the two above tests change for certain configurations due to the Fingerprint change?

@rohany
Copy link
Copy Markdown
Contributor

rohany commented Jun 2, 2020

I think there is some slight overloading of how MustBeValueEncoded is used unfortunately. It seems to be mostly when we want to see if a type can be indexed, rather than "can we actually call encode table key on this type". We seem to have added key encodings for things that we would never actually index for use in various key encoding needs during execution (group by, order by). In these cases, we use the key encoding just as a hash (since some of this code was written before we had finger print) -- i think this answers 1 and 2. I don't know about three.

@madelynnblue
Copy link
Copy Markdown
Contributor Author

Ok so maybe...we can use Fingerprint in more places now? I'll audit uses of EncodeTableKey.

@rohany
Copy link
Copy Markdown
Contributor

rohany commented Jun 2, 2020

That sounds reasonable — encdatum.encode is probably useful to look at too. Most of the calls come from there.

@yuzefovich
Copy link
Copy Markdown
Member

I think I see the issue: in DiskRowContainer.encodeRow we use Encode (on the right side of the hash join), but when we're probing in hashDiskRowBucketIterator.Reset we're calling encodeEqualityCols which under the hood is using Fingerprint.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stats changes :lgtm:

Reviewed 2 of 4 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @rohany)


pkg/sql/opt/exec/execbuilder/testdata/stats, line 265 at r2 (raw file):


query T
EXPLAIN (OPT, VERBOSE) SELECT DISTINCT j FROM tj WHERE j IS NULL

I wonder if this test is going to be flaky... if auto stats are collected before this runs, we'll get different results. Maybe add this to the top of this file (since I think other tests in this file could also have a problem):

statement ok
SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false

(I think auto stats are currently disabled on logic tests, but if we enable them we still want this to work)

@yuzefovich
Copy link
Copy Markdown
Member

This diff fixes the issue with row containers:

diff --git a/pkg/sql/rowcontainer/hash_row_container.go b/pkg/sql/rowcontainer/hash_row_container.go
index 93fe65c1d2..b8ad6bfe95 100644
--- a/pkg/sql/rowcontainer/hash_row_container.go
+++ b/pkg/sql/rowcontainer/hash_row_container.go
@@ -121,7 +121,12 @@ func encodeColumnsOfRow(
                if row[colIdx].IsNull() && !encodeNull {
                        return nil, true, nil
                }
-               appendTo, err = row[colIdx].Fingerprint(colTypes[i], da, appendTo)
+               // Note: we cannot compare VALUE encodings because they contain column IDs
+               // which can vary.
+               // TODO(radu): we should figure out what encoding is readily available and
+               // use that (though it needs to be consistent across all rows). We could add
+               // functionality to compare VALUE encodings ignoring the column ID.
+               appendTo, err = row[colIdx].Encode(colTypes[i], da, sqlbase.DatumEncoding_ASCENDING_KEY, appendTo)
                if err != nil {
                        return appendTo, false, err
                }
@@ -422,7 +427,7 @@ func (i *hashMemRowIterator) computeKey() error {
        i.curKey = i.curKey[:0]
        for _, col := range i.storedEqCols {
                var err error
-               i.curKey, err = row[col].Fingerprint(i.types[col], &i.columnEncoder.datumAlloc, i.curKey)
+               i.curKey, err = row[col].Encode(i.types[col], &i.columnEncoder.datumAlloc, sqlbase.DatumEncoding_ASCENDING_KEY, i.curKey)
                if err != nil {
                        return err
                }

The explanation is that HashDiskRowContainer is implemented using DiskRowContainer with the equality columns (i.e. the columns to hash) of the former being the ordering columns for the latter, and those ordering columns are used to compute the keys of the rows (in encodeRow) so that we could store the row in the sorted order. This way we store the build (right) side of the join, but for the probe (left) side we use hashMemRowIterator to compute the key of the probing row. The key computation methods must be the same in both places, otherwise, the results of the join can be incorrect. #45229 broke this synchronization by changing the key computation method in hashMemRowIterator.computeKey to use Fingerprint. So we have to either use Fingerprint in encodeRow or use Encode in computeKey. The first choice doesn't seem to work because Fingerprint doesn't provide the ordering we need in DiskRowContainer, so we need to use the second approach. @rohany does this sound reasonable?

@rohany
Copy link
Copy Markdown
Contributor

rohany commented Jun 3, 2020

That seems correct, but we will then error out for types that we can't key encode. We can't know in advance whether we are going to spill to disk here either.

Why does the disk row container necessarily need ordering properties of the columns?

@yuzefovich
Copy link
Copy Markdown
Member

yuzefovich commented Jun 3, 2020

Why does the disk row container necessarily need ordering properties of the columns?

DiskRowContainer implements "hash row container" by sorting all rows on the ordering (i.e. hash) columns and using the ordering property to provide the "hashing" behavior (i.e. we would seek to the first row that has the same hash columns and then iterate from that row one row at a time forward until the hash columns remain the same). If we don't have the ordering property, then the necessary invariant that all rows that hash to the same value are contiguous is not maintained.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We merged the fix to the disk row container, so once you rebase, the build should be green.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @rohany)

Copy link
Copy Markdown
Contributor Author

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rohany and @rytaft)


pkg/sql/opt/exec/execbuilder/testdata/stats, line 265 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I wonder if this test is going to be flaky... if auto stats are collected before this runs, we'll get different results. Maybe add this to the top of this file (since I think other tests in this file could also have a problem):

statement ok
SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false

(I think auto stats are currently disabled on logic tests, but if we enable them we still want this to work)

Done.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rohany)

@madelynnblue
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 8, 2020

Merge conflict (retrying...)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 8, 2020

Build failed

@madelynnblue
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 9, 2020

Build failed (retrying...)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 9, 2020

Canceled

@madelynnblue
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 9, 2020

Build failed

Improve datum fingerprinting (used by rowexec/sampler.go) to work
correctly for all types. Previously, due to the hardcoded assumption that
JSON was the only non-key encodable type, it would error if presented
a geo/geom type.

Use MustBeValueEncoded when fingerprinting and determining if a type can
create a histogram. This should be future proof if we add new types or
teach types how to key encode.

Fixes #35844
Informs #48219

Release note: None
@madelynnblue
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 9, 2020

Build succeeded

@craig craig bot merged commit 5fad258 into cockroachdb:master Jun 9, 2020
@madelynnblue madelynnblue deleted the inv-stats branch June 9, 2020 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opt: support creating and using statistics on JSON columns

5 participants