Skip to content

colfetcher: produce batches only of needed columns#71441

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
yuzefovich:cfetcher-only-needed-columns
Oct 27, 2021
Merged

colfetcher: produce batches only of needed columns#71441
craig[bot] merged 2 commits intocockroachdb:masterfrom
yuzefovich:cfetcher-only-needed-columns

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Oct 12, 2021

colbuilder: reuse the SemaCtx

Also inline a single function in the flow diagram.

Release note: None

colfetcher: produce batches only of needed columns

This commit updates the cFetcher to operate only on the needed columns.
Previously, it would get all columns present in the whole table and
create coldata.Batches for each of the columns, even when only
a subset of columns is needed (or even available if the index is not
covering).

For example, imagine we have a table like

t (a INT PRIMARY KEY, b INT, c INT, d INT, INDEX b_idx (b))

and we have a query like SELECT b FROM t@b_idx. Previously, we would
create a batch with 4 int64 columns with only 1 (for column b) being
actually populated and all others marked as "not needed" and set to all
NULL values.

This is suboptimal, and this commit refactors things so that a batch
with only a single vector is created. This is achieved in two steps:

  • first, we populate the slice of column descriptors that are accessible
    from the index. These are all columns of the table for covering indexes.
    For non-covering secondary indexes we only keep columns present in the
    index
  • next, we examine the set of columns that need to be fetched and prune
    all not needed columns away.

Note that we are always careful to update the post-processing spec
accordingly so that the spec always refers to correct new ordinals.

It's worth pointing out that since we're modifying the spec directly, we
had to introduce some special logic to keep the original state of the
PostProcessSpec in order for the flow diagrams to refer to the
original columns.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch 6 times, most recently from e927cad to 4dc24de Compare October 12, 2021 22:36
@yuzefovich
Copy link
Copy Markdown
Member Author

yuzefovich commented Oct 12, 2021

Benchmarks of the hot path of the last two commits:

name                                          old time/op    new time/op    delta
FlowSetup/vectorize=true/distribute=true-24      252µs ± 4%     251µs ± 4%    ~     (p=0.792 n=29+30)
FlowSetup/vectorize=true/distribute=false-24     251µs ± 5%     248µs ± 4%  -1.31%  (p=0.008 n=30+30)

name                                          old alloc/op   new alloc/op   delta
FlowSetup/vectorize=true/distribute=true-24     36.6kB ± 2%    36.4kB ± 3%  -0.37%  (p=0.002 n=25+26)
FlowSetup/vectorize=true/distribute=false-24    34.6kB ± 1%    34.4kB ± 1%  -0.44%  (p=0.000 n=25+25)

name                                          old allocs/op  new allocs/op  delta
FlowSetup/vectorize=true/distribute=true-24        303 ± 0%       294 ± 0%  -2.98%  (p=0.000 n=25+25)
FlowSetup/vectorize=true/distribute=false-24       291 ± 2%       282 ± 1%  -3.15%  (p=0.000 n=29+28)

@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch 3 times, most recently from d2e706e to fabc404 Compare October 14, 2021 21:03
@yuzefovich yuzefovich marked this pull request as ready for review October 14, 2021 21:05
@yuzefovich yuzefovich requested a review from a team as a code owner October 14, 2021 21:05
@yuzefovich

This comment has been minimized.

@yuzefovich yuzefovich requested review from a team, cucaroach and mgartner October 14, 2021 21:06
@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from fabc404 to 2d4149d Compare October 21, 2021 18:47
@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from 2d4149d to 5715d2e Compare October 22, 2021 18:36
@yuzefovich
Copy link
Copy Markdown
Member Author

Rebased on top of master, RFAL.

Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 19 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)


pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):

[0, 0, 1, 0].

Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.


pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):

var idxMap []int

Could make this a named return value.


pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):

		neededColIdx := 0
		for idx, ok := neededColumnsSet.Next(0); ok; idx, ok = neededColumnsSet.Next(idx + 1) {
			idxMap[idx] = neededColIdx

This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?

@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from 5715d2e to 4eda4aa Compare October 23, 2021 00:25
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)


pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…
[0, 0, 1, 0].

Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.

Expanded the comment. Values for indexed 2 and 4 are undefined (they should never be used anyway) because these columns are not present in the index. Added this to the comment as well.


pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…
var idxMap []int

Could make this a named return value.

Done.


pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?

Yes, it'll change and we do want for it to change. In a sense, we're applying a projection on top another projection. I expanded the comment here as well as added an example. Let me know if things now make sense.

Also inline a single function in the flow diagram.

Release note: None
@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from 4eda4aa to c5e4ea8 Compare October 26, 2021 22:44
@yuzefovich
Copy link
Copy Markdown
Member Author

Rebased on top of master. @rharding6373 I think you should be the main reviewer of this change.

Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flow diagrams for EXPLAIN ANALYZE will refer to the remapped
ordinals rather than "absolute" ordinals among all columns in the table
This potentially will add confusion when debugging, no? Is there a method for people to find the mapping in the logs or elsewhere that can be added to the commit/PR message? If not, could we log the remapping?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)


pkg/sql/colfetcher/cfetcher_setup.go, line 223 at r12 (raw file):

		//
		// If non-nil idxMap was passed into this method, we have to update it
		// by essentially applying a projection on top of already present

nit: s/already/the already


pkg/sql/colfetcher/cfetcher_setup.go, line 237 at r12 (raw file):

		// update the index map to be
		//   idxMap = [x, 0, x, x]
		// and then to remap the post-processing spec below so that it refers to

nit: s/to remap/remap

@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from c5e4ea8 to 312927d Compare October 26, 2021 23:20
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to call out the impact. Before this change Out: @1 as in the diagram below
Screen Shot 2021-10-26 at 4.21.49 PM.png
always referred to the column with ordinal 1 in the whole table, but after this change in order to understand which column @1 is referring to, one would need to understand which columns are actually needed to be produced by the TableReader, so yeah, it can definitely lead to some confusion.

However, I think that we rarely if ever look at the concrete ordinals in that Out line - personally, I can't recall a single time where I needed to get that information, on the EXPLAIN (DISTSQL) diagrams I usually quickly glance only on the number of columns produced, and then if I want to understand which columns are propagated by each stage of processors, I'll look at another flavor of EXPLAIN. Assuming we're debugging with a stmt bundle, we'll have the correct information in other places.

In terms of exposing this remapping somewhere - I doubt anyone but engineers on SQL Queries will even think about using it, so I'd rather not tell users about it.

Still, let me take another stab at fixing this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

This commit updates the cFetcher to operate only on the needed columns.
Previously, it would get all columns present in the whole table and
create `coldata.Batch`es for each of the columns, even when only
a subset of columns is needed (or even available if the index is not
covering).

For example, imagine we have a table like
```
t (a INT PRIMARY KEY, b INT, c INT, d INT, INDEX b_idx (b))
```
and we have a query like `SELECT b FROM t@b_idx`. Previously, we would
create a batch with 4 `int64` columns with only 1 (for column `b`) being
actually populated and all others marked as "not needed" and set to all
NULL values.

This is suboptimal, and this commit refactors things so that a batch
with only a single vector is created. This is achieved in two steps:
- first, we populate the slice of column descriptors that are accessible
from the index. These are all columns of the table for covering indexes.
For non-covering secondary indexes we only keep columns present in the
index
- next, we examine the set of columns that need to be fetched and prune
all not needed columns away.

Note that we are always careful to update the post-processing spec
accordingly so that the spec always refers to correct new ordinals.

It's worth pointing out that since we're modifying the spec directly, we
had to introduce some special logic to keep the original state of the
`PostProcessSpec` in order for the flow diagrams to refer to the
original columns.

Release note: None
@yuzefovich yuzefovich force-pushed the cfetcher-only-needed-columns branch from 312927d to 8e40f1c Compare October 27, 2021 00:48
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I figured out a way to keep the flow diagrams unchanged.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)

@yuzefovich
Copy link
Copy Markdown
Member Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 27, 2021

Build succeeded:

@craig craig bot merged commit df094a1 into cockroachdb:master Oct 27, 2021
@yuzefovich yuzefovich deleted the cfetcher-only-needed-columns branch October 27, 2021 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants