sql: don't scan unnecessary columns under JOINs#11736
sql: don't scan unnecessary columns under JOINs#11736RaduBerinde merged 5 commits intocockroachdb:masterfrom
Conversation
|
Regarding the first commit. I had to go and look at the source code because obviously EXPLAIN(VERBOSE) also lists a set of columns in the "result" part. There's something strange going on with scanNode, which is that it both has a set of columns in Now after reading this code, I seem to understand (please correct me if I'm wrong) that even though we do not return values for the other columns, we need them to be listed by Columns() because they may appear in the ordering information. The key concept here is, I believe, the idea that we can have columns that participate in the ordering but are not returned as values by the node. If my understanding is correct then:
LGTM for the 1st commit if my understanding is correct, otherwise I'd like to see your explanation. Reviewed 6 of 6 files at r1. Comments from Reviewable |
|
Regarding the 2nd commit. Please also add a field in Reviewed 26 of 26 files at r2. Comments from Reviewable |
|
Reviewed 1 of 1 files at r3. pkg/sql/select.go, line 408 at r3 (raw file):
Excellent. Truth be told I've been wanting this optimization for ages! 👍 Comments from Reviewable |
|
Excellent initiative. I miss the following two things:
Reviewed 3 of 3 files at r4. Comments from Reviewable |
|
I don't think it was driven by ordering, it was just easier to implement it this way. It is easier if the scanNode columns are always the table columns. It would be a pain if the I will add the I think the ghost sorting thing is not relevant for the current codebase - any columns we sort on will be marked as "needed" because they are added as renders by the sort node. We may want to do the same in the IR, at least for starters. But yeah it's definitely a thing to keep in mind. On recursing in the sub-node: the aim of this PR is to improve the common case of table joins; implementation of the new primitive for all nodes can be done later. In particular, it's not so easy to recurse through select nodes - you have to figure out which source columns are needed in whatever subset of renders is needed by the higher node. Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
|
Ok I agree that having the return value of Regarding the sort node the reason why I mentioned that ghost columns are useful is because this is the information that the sort node needs to decide whether to add the additional renders or not. If the data is already sorted on the required column, even if the data for that column is not present, then there is no additional render required. We already have this optimization in place by the way so if you have SELECT v FROM kv ORDER BY k and the ordering info on kv says the data is already ordered by k, the sort node will vanish. However its applicability is limited, due to the lack of proper "ghost column" annotations, for example in this case: Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
|
I agree. Maybe what we want is the ordering info to refer to columns not as indices in Anyway, about the case you mentioned, didn't we talk about this yesterday and concluded that we do add the Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
|
Yes currently it isn't but I wanted to share the thought anyway :) Looking forward to your extended commit message, the rest LGTM. |
|
Haha, it looks like the EXPLAIN (VERBOSE) output is non-deterministic Any ideas? How are we deciding which column shows up for the "equals", is there a |
|
Yes there is a map. It's the tableAlias map in the dataSourceInfo. If you can pinpoint where the problem is occurring I can perhaps suggest a better approach. |
|
Ah. I think it's |
|
No problem! (as long as the lookups still work fine...) |
ebfcfa9 to
ff4354d
Compare
|
Updated - added omitted field (first commit), and added a commit that changes the map for a slice. |
ff4354d to
402b58e
Compare
In some cases, results are not produced for all `Columns()` reported by a `planNode`. Specifically, scanNodes always present all table columns under `Columns()` even if not all of them are decoded. The reason for this is mostly for code simplicity - the needed columns are determined during planning and can change (e.g. if we are creating an indexJoin node); if the column schema would change, it would require remapping of indexed vars, ordering information, etc. In this change we add an `omitted` field to `ResultColumn` and show this information in `EXPLAIN (VERBOSE)`.
Generalizing the scanNode primitive that tells the node that only a subset of its `Columns()` are needed.
The sourceAliases map is causing nondeterministic `EXPLAIN (VERBOSE)` behavior. Replacing with a slice.
402b58e to
37d85c5
Compare
|
@knz - let me know if the last commit and the new version of the first commit look ok. |
|
Reviewed 8 of 37 files at r5, 1 of 1 files at r6, 23 of 26 files at r7, 1 of 1 files at r8, 1 of 3 files at r9, 4 of 4 files at r10. Comments from Reviewable |
|
TFTR! Comments from Reviewable |
Currently for any join, the scanNodes decode and return all the columns in the table. This is not ideal and will be even more of a problem with distSQL which uses the scanNodes for physical planning.
This set of commits plumbs the "val needed for column" information through joins.
CC @a6802739
This change is