colfetcher: produce batches only of needed columns#71441
colfetcher: produce batches only of needed columns#71441craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
e927cad to
4dc24de
Compare
|
Benchmarks of the hot path of the last two commits: |
d2e706e to
fabc404
Compare
This comment has been minimized.
This comment has been minimized.
fabc404 to
2d4149d
Compare
2d4149d to
5715d2e
Compare
|
Rebased on top of master, RFAL. |
rharding6373
left a comment
There was a problem hiding this comment.
Reviewed 6 of 19 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)
pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):
[0, 0, 1, 0].
Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.
pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):
var idxMap []int
Could make this a named return value.
pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):
neededColIdx := 0 for idx, ok := neededColumnsSet.Next(0); ok; idx, ok = neededColumnsSet.Next(idx + 1) { idxMap[idx] = neededColIdx
This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?
5715d2e to
4eda4aa
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)
pkg/sql/colfetcher/cfetcher_setup.go, line 88 at r10 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
[0, 0, 1, 0].Could you elaborate on how/why this idxMap was reached in this example? After walking through the code a couple times I'm still unsure why the idxMap isn't [1, 0, 0, 0] and why it's ok for indexes 2 and 4 to be 0 in the idxMap.
Expanded the comment. Values for indexed 2 and 4 are undefined (they should never be used anyway) because these columns are not present in the index. Added this to the comment as well.
pkg/sql/colfetcher/cfetcher_setup.go, line 125 at r10 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
var idxMap []intCould make this a named return value.
Done.
pkg/sql/colfetcher/cfetcher_setup.go, line 218 at r10 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
This seems like duplicate work if idxMap was non-nil. Will idxMap change if it's already populated?
Yes, it'll change and we do want for it to change. In a sense, we're applying a projection on top another projection. I expanded the comment here as well as added an example. Let me know if things now make sense.
Also inline a single function in the flow diagram. Release note: None
4eda4aa to
c5e4ea8
Compare
|
Rebased on top of master. @rharding6373 I think you should be the main reviewer of this change. |
rharding6373
left a comment
There was a problem hiding this comment.
the flow diagrams for EXPLAIN ANALYZE will refer to the remapped
ordinals rather than "absolute" ordinals among all columns in the table
This potentially will add confusion when debugging, no? Is there a method for people to find the mapping in the logs or elsewhere that can be added to the commit/PR message? If not, could we log the remapping?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, @rharding6373, and @yuzefovich)
pkg/sql/colfetcher/cfetcher_setup.go, line 223 at r12 (raw file):
// // If non-nil idxMap was passed into this method, we have to update it // by essentially applying a projection on top of already present
nit: s/already/the already
pkg/sql/colfetcher/cfetcher_setup.go, line 237 at r12 (raw file):
// update the index map to be // idxMap = [x, 0, x, x] // and then to remap the post-processing spec below so that it refers to
nit: s/to remap/remap
c5e4ea8 to
312927d
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Just want to call out the impact. Before this change Out: @1 as in the diagram below
always referred to the column with ordinal 1 in the whole table, but after this change in order to understand which column @1 is referring to, one would need to understand which columns are actually needed to be produced by the TableReader, so yeah, it can definitely lead to some confusion.
However, I think that we rarely if ever look at the concrete ordinals in that Out line - personally, I can't recall a single time where I needed to get that information, on the EXPLAIN (DISTSQL) diagrams I usually quickly glance only on the number of columns produced, and then if I want to understand which columns are propagated by each stage of processors, I'll look at another flavor of EXPLAIN. Assuming we're debugging with a stmt bundle, we'll have the correct information in other places.
In terms of exposing this remapping somewhere - I doubt anyone but engineers on SQL Queries will even think about using it, so I'd rather not tell users about it.
Still, let me take another stab at fixing this.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)
This commit updates the cFetcher to operate only on the needed columns. Previously, it would get all columns present in the whole table and create `coldata.Batch`es for each of the columns, even when only a subset of columns is needed (or even available if the index is not covering). For example, imagine we have a table like ``` t (a INT PRIMARY KEY, b INT, c INT, d INT, INDEX b_idx (b)) ``` and we have a query like `SELECT b FROM t@b_idx`. Previously, we would create a batch with 4 `int64` columns with only 1 (for column `b`) being actually populated and all others marked as "not needed" and set to all NULL values. This is suboptimal, and this commit refactors things so that a batch with only a single vector is created. This is achieved in two steps: - first, we populate the slice of column descriptors that are accessible from the index. These are all columns of the table for covering indexes. For non-covering secondary indexes we only keep columns present in the index - next, we examine the set of columns that need to be fetched and prune all not needed columns away. Note that we are always careful to update the post-processing spec accordingly so that the spec always refers to correct new ordinals. It's worth pointing out that since we're modifying the spec directly, we had to introduce some special logic to keep the original state of the `PostProcessSpec` in order for the flow diagrams to refer to the original columns. Release note: None
312927d to
8e40f1c
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Ok, I think I figured out a way to keep the flow diagrams unchanged.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)
rharding6373
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @mgartner, and @rharding6373)
|
TFTR! bors r+ |
|
Build succeeded: |

colbuilder: reuse the SemaCtx
Also inline a single function in the flow diagram.
Release note: None
colfetcher: produce batches only of needed columns
This commit updates the cFetcher to operate only on the needed columns.
Previously, it would get all columns present in the whole table and
create
coldata.Batches for each of the columns, even when onlya subset of columns is needed (or even available if the index is not
covering).
For example, imagine we have a table like
and we have a query like
SELECT b FROM t@b_idx. Previously, we wouldcreate a batch with 4
int64columns with only 1 (for columnb) beingactually populated and all others marked as "not needed" and set to all
NULL values.
This is suboptimal, and this commit refactors things so that a batch
with only a single vector is created. This is achieved in two steps:
from the index. These are all columns of the table for covering indexes.
For non-covering secondary indexes we only keep columns present in the
index
all not needed columns away.
Note that we are always careful to update the post-processing spec
accordingly so that the spec always refers to correct new ordinals.
It's worth pointing out that since we're modifying the spec directly, we
had to introduce some special logic to keep the original state of the
PostProcessSpecin order for the flow diagrams to refer to theoriginal columns.
Release note: None