colfetcher: produce batches only of needed columns by yuzefovich · Pull Request #71441 · cockroachdb/cockroach

yuzefovich · 2021-10-12T03:48:39Z

colbuilder: reuse the SemaCtx

Also inline a single function in the flow diagram.

Release note: None

colfetcher: produce batches only of needed columns

This commit updates the cFetcher to operate only on the needed columns.
Previously, it would get all columns present in the whole table and
create coldata.Batches for each of the columns, even when only
a subset of columns is needed (or even available if the index is not
covering).

For example, imagine we have a table like

t (a INT PRIMARY KEY, b INT, c INT, d INT, INDEX b_idx (b))

and we have a query like SELECT b FROM t@b_idx. Previously, we would
create a batch with 4 int64 columns with only 1 (for column b) being
actually populated and all others marked as "not needed" and set to all
NULL values.

This is suboptimal, and this commit refactors things so that a batch
with only a single vector is created. This is achieved in two steps:

first, we populate the slice of column descriptors that are accessible
from the index. These are all columns of the table for covering indexes.
For non-covering secondary indexes we only keep columns present in the
index
next, we examine the set of columns that need to be fetched and prune
all not needed columns away.

Note that we are always careful to update the post-processing spec
accordingly so that the spec always refers to correct new ordinals.

It's worth pointing out that since we're modifying the spec directly, we
had to introduce some special logic to keep the original state of the
PostProcessSpec in order for the flow diagrams to refer to the
original columns.

Release note: None

cockroach-teamcity · 2021-10-12T03:48:47Z

This change is

yuzefovich · 2021-10-12T22:42:12Z

Benchmarks of the hot path of the last two commits:

name                                          old time/op    new time/op    delta
FlowSetup/vectorize=true/distribute=true-24      252µs ± 4%     251µs ± 4%    ~     (p=0.792 n=29+30)
FlowSetup/vectorize=true/distribute=false-24     251µs ± 5%     248µs ± 4%  -1.31%  (p=0.008 n=30+30)

name                                          old alloc/op   new alloc/op   delta
FlowSetup/vectorize=true/distribute=true-24     36.6kB ± 2%    36.4kB ± 3%  -0.37%  (p=0.002 n=25+26)
FlowSetup/vectorize=true/distribute=false-24    34.6kB ± 1%    34.4kB ± 1%  -0.44%  (p=0.000 n=25+25)

name                                          old allocs/op  new allocs/op  delta
FlowSetup/vectorize=true/distribute=true-24        303 ± 0%       294 ± 0%  -2.98%  (p=0.000 n=25+25)
FlowSetup/vectorize=true/distribute=false-24       291 ± 2%       282 ± 1%  -3.15%  (p=0.000 n=29+28)

yuzefovich · 2021-10-22T18:36:56Z

Rebased on top of master, RFAL.

rharding6373

Reviewed 6 of 19 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)

pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):

[0, 0, 1, 0].

Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.

pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):

var idxMap []int

Could make this a named return value.

pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):

		neededColIdx := 0
		for idx, ok := neededColumnsSet.Next(0); ok; idx, ok = neededColumnsSet.Next(idx + 1) {
			idxMap[idx] = neededColIdx

This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

[0, 0, 1, 0].
Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.

Expanded the comment. Values for indexed 2 and 4 are undefined (they should never be used anyway) because these columns are not present in the index. Added this to the comment as well.

pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

var idxMap []int
Could make this a named return value.

Done.

pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?

Yes, it'll change and we do want for it to change. In a sense, we're applying a projection on top another projection. I expanded the comment here as well as added an example. Let me know if things now make sense.

Also inline a single function in the flow diagram. Release note: None

yuzefovich · 2021-10-26T22:46:59Z

Rebased on top of master. @rharding6373 I think you should be the main reviewer of this change.

rharding6373

the flow diagrams for EXPLAIN ANALYZE will refer to the remapped
ordinals rather than "absolute" ordinals among all columns in the table
This potentially will add confusion when debugging, no? Is there a method for people to find the mapping in the logs or elsewhere that can be added to the commit/PR message? If not, could we log the remapping?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)

pkg/sql/colfetcher/cfetcher_setup.go, line 223 at r12 (raw file):

		//
		// If non-nil idxMap was passed into this method, we have to update it
		// by essentially applying a projection on top of already present

nit: s/already/the already

pkg/sql/colfetcher/cfetcher_setup.go, line 237 at r12 (raw file):

		// update the index map to be
		//   idxMap = [x, 0, x, x]
		// and then to remap the post-processing spec below so that it refers to

nit: s/to remap/remap

yuzefovich

Just want to call out the impact. Before this change Out: @1 as in the diagram below

always referred to the column with ordinal 1 in the whole table, but after this change in order to understand which column @1 is referring to, one would need to understand which columns are actually needed to be produced by the TableReader, so yeah, it can definitely lead to some confusion.

However, I think that we rarely if ever look at the concrete ordinals in that Out line - personally, I can't recall a single time where I needed to get that information, on the EXPLAIN (DISTSQL) diagrams I usually quickly glance only on the number of columns produced, and then if I want to understand which columns are propagated by each stage of processors, I'll look at another flavor of EXPLAIN. Assuming we're debugging with a stmt bundle, we'll have the correct information in other places.

In terms of exposing this remapping somewhere - I doubt anyone but engineers on SQL Queries will even think about using it, so I'd rather not tell users about it.

Still, let me take another stab at fixing this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

This commit updates the cFetcher to operate only on the needed columns. Previously, it would get all columns present in the whole table and create `coldata.Batch`es for each of the columns, even when only a subset of columns is needed (or even available if the index is not covering). For example, imagine we have a table like ``` t (a INT PRIMARY KEY, b INT, c INT, d INT, INDEX b_idx (b)) ``` and we have a query like `SELECT b FROM t@b_idx`. Previously, we would create a batch with 4 `int64` columns with only 1 (for column `b`) being actually populated and all others marked as "not needed" and set to all NULL values. This is suboptimal, and this commit refactors things so that a batch with only a single vector is created. This is achieved in two steps: - first, we populate the slice of column descriptors that are accessible from the index. These are all columns of the table for covering indexes. For non-covering secondary indexes we only keep columns present in the index - next, we examine the set of columns that need to be fetched and prune all not needed columns away. Note that we are always careful to update the post-processing spec accordingly so that the spec always refers to correct new ordinals. It's worth pointing out that since we're modifying the spec directly, we had to introduce some special logic to keep the original state of the `PostProcessSpec` in order for the flow diagrams to refer to the original columns. Release note: None

yuzefovich

Ok, I think I figured out a way to keep the flow diagrams unchanged.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

rharding6373

Thanks!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

yuzefovich · 2021-10-27T15:39:59Z

TFTR!

bors r+

craig · 2021-10-27T16:58:43Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich force-pushed the cfetcher-only-needed-columns branch 6 times, most recently from e927cad to 4dc24de Compare October 12, 2021 22:36

yuzefovich force-pushed the cfetcher-only-needed-columns branch 3 times, most recently from d2e706e to fabc404 Compare October 14, 2021 21:03

yuzefovich marked this pull request as ready for review October 14, 2021 21:05

yuzefovich requested a review from a team as a code owner October 14, 2021 21:05

This comment has been minimized.

Sign in to view

yuzefovich requested review from a team, cucaroach and mgartner October 14, 2021 21:06

yuzefovich force-pushed the cfetcher-only-needed-columns branch from fabc404 to 2d4149d Compare October 21, 2021 18:47

yuzefovich requested a review from rharding6373 October 21, 2021 19:01

yuzefovich force-pushed the cfetcher-only-needed-columns branch from 2d4149d to 5715d2e Compare October 22, 2021 18:36

rharding6373 requested changes Oct 22, 2021

View reviewed changes

yuzefovich force-pushed the cfetcher-only-needed-columns branch from 5715d2e to 4eda4aa Compare October 23, 2021 00:25

yuzefovich commented Oct 23, 2021

View reviewed changes

colbuilder: reuse the SemaCtx

61f736a

Also inline a single function in the flow diagram. Release note: None

yuzefovich force-pushed the cfetcher-only-needed-columns branch from 4eda4aa to c5e4ea8 Compare October 26, 2021 22:44

rharding6373 reviewed Oct 26, 2021

View reviewed changes

yuzefovich force-pushed the cfetcher-only-needed-columns branch from c5e4ea8 to 312927d Compare October 26, 2021 23:20

yuzefovich commented Oct 26, 2021

View reviewed changes

yuzefovich force-pushed the cfetcher-only-needed-columns branch from 312927d to 8e40f1c Compare October 27, 2021 00:48

yuzefovich commented Oct 27, 2021

View reviewed changes

rharding6373 approved these changes Oct 27, 2021

View reviewed changes

craig bot merged commit df094a1 into cockroachdb:master Oct 27, 2021

yuzefovich deleted the cfetcher-only-needed-columns branch October 27, 2021 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colfetcher: produce batches only of needed columns#71441

colfetcher: produce batches only of needed columns#71441
craig[bot] merged 2 commits intocockroachdb:masterfrom
yuzefovich:cfetcher-only-needed-columns

yuzefovich commented Oct 12, 2021 •

edited

Loading

Uh oh!

cockroach-teamcity commented Oct 12, 2021

Uh oh!

yuzefovich commented Oct 12, 2021 •

edited

Loading

Uh oh!

This comment has been minimized.

yuzefovich commented Oct 22, 2021

Uh oh!

rharding6373 left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

yuzefovich commented Oct 26, 2021

Uh oh!

rharding6373 left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

rharding6373 left a comment

Uh oh!

yuzefovich commented Oct 27, 2021

Uh oh!

craig bot commented Oct 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuzefovich commented Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Oct 12, 2021

Uh oh!

yuzefovich commented Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

yuzefovich commented Oct 22, 2021

Uh oh!

rharding6373 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Oct 26, 2021

Uh oh!

rharding6373 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

rharding6373 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Oct 27, 2021

Uh oh!

craig bot commented Oct 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuzefovich commented Oct 12, 2021 •

edited

Loading

yuzefovich commented Oct 12, 2021 •

edited

Loading