sql: use type bytes for virtual inverted columns by rytaft · Pull Request #58241 · cockroachdb/cockroach

rytaft · 2020-12-23T14:56:52Z

This commit changes the column type for virtual inverted columns when
constructing scanNodes and TableReaderSpecs so that the type matches the data
type actually stored in the index. Prior to this commit, the type corresponded
to the column being indexed (e.g., geometry), rather than the actual type of
the key column (e.g., bytes). This is needed in order to enable vectorized
execution with the invertedFilterer processor.

Additionally, the column fetecher now treats virtual inverted columns
differently and dumps the data directly into a DBytes rather than attempting
to decode it.

Fixes #50695

Release note (performance improvement): Queries that use a geospatial inverted
index can now take advantage of vectorized execution for some parts of the
query plan, resulting in improved performance.

cockroach-teamcity · 2020-12-23T14:56:58Z

This change is

rytaft · 2020-12-23T15:06:37Z

This PR is the result of rebasing #53202 and making a few tweaks:

Now the type of all inverted virtual columns is Bytes.
The cFetcher and index decoding functions are now aware of virtual inverted columns, and will dump the output directly into a DBytes instead of trying to decode it.

In order to get this working quickly, I added a hack for zig zag joins (it's called out in distsql_physical_planner.go). We'll need to do a bigger refactor of zig zag joins to get this to work without the hack, although I think we were going to need to do that anyway to support multi-column inverted indexes and integrate with the other work we've been doing in the optimizer to improve inverted index support. If this PR looks like the right approach, then I'll do the zig zag join refactor as a second commit in this PR.

yuzefovich

Looks good to me, thanks for working on this!

Reviewed 26 of 26 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @RaduBerinde, and @rytaft)

pkg/sql/distsql_spec_exec_factory.go, line 240 at r1 (raw file):

		NeededColumns:    colCfg.wantedColumnsOrdinals,
	}
	if colCfg.virtualColumns != nil {

nit: this code is duplicated with distsql_physical_planner.

pkg/sql/catalog/colinfo/col_type_info.go, line 172 at r1 (raw file):

func GetColumnTypesFromColDescs(
	cols []descpb.ColumnDescriptor, columnIDs []descpb.ColumnID, outTypes []*types.T,
) ([]*types.T, error) {

nit: seems like err is always nil.

pkg/sql/rowexec/inverted_filterer.go, line 228 at r1 (raw file):

		// If the input is from the vectorized engine, the encoded bytes may be
		// empty.
		if row[ifr.invertedColIdx].Datum == nil {

Can we ever have a DNull value here?

pkg/sql/rowexec/rowfetcher.go, line 89 at r1 (raw file):

	}
	if virtualColumns != nil {
		tempCols := make([]descpb.ColumnDescriptor, len(cols), len(cols)+len(systemColumns))

nit: duplicated with colbatch_scan.go.

rytaft

TFTR, @yuzefovich! I'll work on adding the zig zag join commit now.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @RaduBerinde, and @yuzefovich)

pkg/sql/distsql_spec_exec_factory.go, line 240 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this code is duplicated with distsql_physical_planner.

Done.

pkg/sql/catalog/colinfo/col_type_info.go, line 172 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like err is always nil.

Done.

pkg/sql/rowexec/inverted_filterer.go, line 228 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Can we ever have a DNull value here?

Shouldn't ever be DNull since nulls aren't stored in inverted indexes.

pkg/sql/rowexec/rowfetcher.go, line 89 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: duplicated with colbatch_scan.go.

Done.

rytaft

I've realized that the zigzag join refactor should be in a separate PR. I've found a way to remove the hack I added without needing to do the refactor. This is ready for review.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @RaduBerinde, and @yuzefovich)

sumeerbhola

Reviewed 7 of 26 files at r1, 7 of 9 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @RaduBerinde, @rytaft, and @yuzefovich)

pkg/sql/scan.go, line 119 at r2 (raw file):

	// virtualColumns maps a subset of wantedColumns that are virtual to the
	// column type actually stored in the index. For example, the key column

nit: how about "inverted column" instead of "key column" since we now have multi-column inverted indexes?

pkg/sql/execinfrapb/processors_sql.proto, line 150 at r2 (raw file):

  // stored in the table descriptor. For example, the key column in an inverted
  // index has a different type than the column it indexes in the base table.
  repeated sqlbase.ColumnDescriptor virtual_columns = 16;

wondering why this is a repeated field instead of being optional.

pkg/sql/rowexec/inverted_filterer.go, line 228 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Shouldn't ever be DNull since nulls aren't stored in inverted indexes.

Could you add a code comment that states this.

pkg/sql/rowexec/inverted_filterer.go, line 236 at r2 (raw file):

			return ifrStateUnknown, ifr.DrainHelper()
		}
		enc = []byte(*row[ifr.invertedColIdx].Datum.(*tree.DBytes))

This is a peculiar hack and I think deserves a longer comment above the if-block. Something like:
// NB: Inverted columns are custom encoded in a manner that does not correspond to Datum encoding, and in the code here we only want the encoded bytes. We have two possibilities with what the provider of this row has done:
// - not decoded the row: This is the len(enc) > 0 case.
// - decoded the row, but special-cased the inverted column by stuffing the encoded bytes into a "decoded" DBytes: This is the len(enc) == 0 case.

pkg/sql/rowexec/inverted_joiner.go, line 463 at r2 (raw file):

		}
		idx := ij.colIdxMap.GetDefault(ij.invertedColID)
		encInvertedVal := scannedRow[idx].EncodedBytes()

If the rowFetcher used by invertedJoiner had a columnar implementation, we would need to make a similar change here, yes?

mgartner

Reviewed 20 of 26 files at r1, 9 of 9 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @RaduBerinde, @rytaft, and @yuzefovich)

jordanlewis

mod comments, thanks so much for taking care of this!

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde, @rytaft, and @yuzefovich)

pkg/sql/catalog/tabledesc/structured.go, line 4019 at r2 (raw file):

// bool is true. If virtualCols is non-nil, substitutes the type of the virtual
// column instead of the table column with the same ID.
func (desc *Immutable) ColumnTypesWithMutationsAndVirtualCols(

cc @ajwerner @postamar do we need to be adding stuff like this to interfaces now too?

rytaft

TFTRs!

Based on @sumeerbhola's observation below, I think there is some unnecessary complexity in this implementation. I'm going to try a quick experiment to see if I can remove some of the complexity.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, @sumeerbhola, and @yuzefovich)

pkg/sql/scan.go, line 119 at r2 (raw file):