sql,storage: add support for COL_BATCH_RESPONSE scan format#94438
sql,storage: add support for COL_BATCH_RESPONSE scan format#94438craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
186d49a to
a284dbb
Compare
07f8ff0 to
f2fc46d
Compare
d24e48c to
26940e1
Compare
3cb7225 to
c6a32ca
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/roachpb/api.proto line 673 at r7 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit]
rowin->row inand same above.
Done.
pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
I think I still want to push on this a bit more, since it would be nice to decouple things. What if we added a method to the
NextKVerinterfaceSetFirstKVOfRowthat would be called by thecFetcherwhen it reachesstateDecodeFirstKVOfRow? I think that should givemvccScanFetchAdapterall the information it needs to implement my earlier suggestion, right?
Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow on every row seen by the cFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.
Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults returning the appropriate first key of the trimmed row and the cFetcher actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.
pkg/storage/col_mvcc.go line 110 at r7 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit]
stablenow just depends on the implementation and is always true or false, right? Consider either making it a separate method on the interface or making it a field on thecFetcheritself.
Good point, extracted into another method in NextKVer interface.
DrewKimball
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2, @sumeerbhola, and @yuzefovich)
pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call
SetFirstKeyOfRowon every row seen by thecFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing
singleResultsreturning the appropriate first key of the trimmed row and thecFetcheractually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.
One last option - we could make the first key of the row an argument to NextKV.
Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.
sumeerbhola
left a comment
There was a problem hiding this comment.
minor nits about comments. I am happy with the storage interfaces and code, so feel free to merge when @DrewKimball is satisfied.
Reviewed 1 of 3 files at r10.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)
pkg/storage/col_mvcc.go line 246 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This part was very confusing - although the KV is unstable, in some cases the
cFetcherdoesn't need to hold onto it after the followingNextKVcall, so the copying is not needed (if there is only one column family in the table). The complication was that a copy needed to be performed for a very different reason in the old code path (in therow.KVFetcher). I refactored the code to clean things up, and now the implementation here should make more sense. In particular, we now just say that the returned KV is unstable, and thecFetcheritself decides whether to deep copy it or not.
Looks much cleaner now.
pkg/storage/col_mvcc.go line 124 at r10 (raw file):
// When (ok=false,partialRow=true) is returned, the caller is expected to // discard all KVs that were part of the last SQL row that was incomplete. // The scan will be resumed from the last key provided by SetFirstKeyOfRow
was this supposed to say FirstKeyOfRowGetter?
pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):
To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption
getOne could call put with key a and then enquire about key b, even though it has no intention of call put with key b in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults implementation of results. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.
c6a32ca to
00362eb
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/sql/vars.go line 490 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
IIRC
ALTER ROLE ALLdoesn't affectrootuser, and for logic tests I think we do userootoften. The direct columnar scans need to be disabled for some types of logic tests, and it seems like we do need an explicit cluster setting to handle that. Perhaps we'd just need to do bothALTER ROLE ALLandALTER ROLE root. Happy to be taught here since the choice of using the cluster setting can be driven simply by me being used to this way.
My current understanding of the situation is that we no longer allow cluster settings with sql.defaults prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.
That said, I did ask in slack, will see what SQL Session team recommends.
pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
One last option - we could make the first key of the row an argument to
NextKV.Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.
I like Getter method approach the most - passing a key when calling NextKV (which returns a key) seems wrong.
pkg/storage/col_mvcc.go line 124 at r10 (raw file):
Previously, sumeerbhola wrote…
was this supposed to say
FirstKeyOfRowGetter?
Yes, this was a leftover, fixed.
pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):
Previously, sumeerbhola wrote…
To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption
getOnecould callputwith keyaand then enquire about keyb, even though it has no intention of callputwith keybin this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in thesingleResultsimplementation ofresults. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.
Thanks, makes sense, added this information into the comments.
DrewKimball
left a comment
There was a problem hiding this comment.
Reviewed 8 of 22 files at r7, 10 of 10 files at r11, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @sumeerbhola, and @yuzefovich)
|
I've got some performance numbers to share. The microbenchmarks are as expected:
TPCH numbers though aren't as good: This is a bit disappointing given the original very good numbers some time ago in the prototype. I did a bisect last week and couldn't reproduce those results. I'll do another bisect tomorrow, but most likely I messed something up when obtaining those numbers (although I do remember collecting them twice). |
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)
pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):
// returned on the NextKVer.NextKV call). Therefore, the singleResults can // make a determination on its own whether the given key belongs to the // first SQL row.
Now that singleResults is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put, so even if that KV is still buffered, it is ok.
00362eb to
cbb68d0
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):
Previously, sumeerbhola wrote…
Now that
singleResultsis making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly input, so even if that KV is still buffered, it is ok.
Yes, you're right. I tightened the contract here a bit further since continuesFirstRow is only called if put has already been called at least once.
|
For reference, with the local fastpath for in-memory tenants (WIP in #95033) the benchmarks are here (the comparison is master vs this PR + #95033 with direct columnar scans always enabled). They show that for in-memory tenants (this is how UA will be achieved for self-hosted customers) the KV projection pushdown work already has benefits in some scenarios, and I hope with more targeted optimization work we can improve there noticeably. |
|
Finally, I was able to reproduce the "great" results I obtained on the prototype two months ago. In short, they were due to a subtle bug in In particular, as can be seen in #95793 (ignoring the last commit) the prototype had the following: after processing the Scan request via the direct columnar scans. As a result, after evaluating the Scan we didn't update limits of the Once this bug is fixed (i.e. the last commit in #95793 is applied), we get the results that show effectively no difference of the prototype on TPCH queries: (It's interesting to observe that on the current polished PR there are some noticeable improvements and noticeable regressions, I haven't explored this in-depth though, and will do so in the next milestone.) This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set Thus, I think there are no more blockers for this PR to be merged. It shows an expected improvement in microbenchmarks when only a subset of columns is needed in multi-tenant separate-process setups. I'll wait for Michael's approval. |
Maybe this isn't the best place to discuss, but I will anyway :) Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the |
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.
The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).
Here is the outline of all components involved:
┌────────────────────────────────────────────────┐
│ SQL │
│________________________________________________│
│ colfetcher.ColBatchDirectScan │
│ │ │
│ ▼ │
│ row.txnKVFetcher │
│ (behind the row.KVBatchFetcher interface) │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ KV Client │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ KV Server │
│________________________________________________│
│ colfetcher.cFetcherWrapper │
│ (behind the storage.CFetcherWrapper interface) │
│ │ │
│ ▼ │
│ colfetcher.cFetcher │
│ │ │
│ ▼ │
│ storage.mvccScanFetchAdapter ────────┐│
│ (behind the storage.NextKVer interface) ││
│ │ ││
│ ▼ ││
│ storage.pebbleMVCCScanner ││
│ (which put's KVs into storage.singleResults) <┘│
└────────────────────────────────────────────────┘
On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with
the `COL_BATCH_RESPONSE` format and returns the response (which contains the
columnar data) to the `colfetcher.ColBatchDirectScan`.
On the KV server side, we create a `storage.CFetcherWrapper` that asks the
`colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
`mvccScanFetchAdapter` which - via the `singleResults` struct - exposes
access to the current KV that the `pebbleMVCCScanner` is pointing at.
Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
`storage.singleResults.maybeTrimPartialLastRow` must be in sync with the
`colfetcher.cFetcher` which is achieved by
- the `cFetcher` exposing access to the first key of the last incomplete SQL
row via the `FirstKeyOfRowGetter`,
- the `singleResults` using that key as the resume key for the response,
- and the `cFetcher` removing that last partial SQL row when `NextKV()`
returns `partialRow=true`.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the `storage.pebbleResults` via tracking
offsets into the `pebbleResults.repr`.)
This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the `pebbleMVCCScanner` operates and the
"pull" mode that the `NextKVer` exposes. The adaption between two different
modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control
returns to it) the current unstable KV pair from the `singleResults` struct
which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into.
Second, in order be able to use the unstable KV pair without performing a
copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the
control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to
the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when
the next KV pair is needed.
There are multiple scenarios which are currently not supported:
- SQL cannot issue Get requests (likely will support in 23.1)
- `TraceKV` option is not supported (likely will support in 23.1)
- user-defined types other than enums are not supported (will _not_
support in 23.1)
- non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will _not_ support in 23.1).
The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.
The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719)
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.
The TPCH numbers, though, don't show the expected speedup:
```
Q1: before: 11.47s after: 8.84s -22.89%
Q2: before: 0.41s after: 0.29s -27.71%
Q3: before: 7.89s after: 9.68s 22.63%
Q4: before: 4.48s after: 4.52s 0.86%
Q5: before: 10.39s after: 10.35s -0.29%
Q6: before: 33.57s after: 33.41s -0.48%
Q7: before: 23.82s after: 23.81s -0.02%
Q8: before: 3.78s after: 3.76s -0.68%
Q9: before: 28.15s after: 28.03s -0.42%
Q10: before: 5.00s after: 4.98s -0.42%
Q11: before: 2.44s after: 2.44s 0.22%
Q12: before: 34.78s after: 34.65s -0.37%
Q13: before: 3.20s after: 2.94s -8.28%
Q14: before: 3.13s after: 3.21s 2.43%
Q15: before: 16.80s after: 16.73s -0.38%
Q16: before: 1.60s after: 1.65s 2.96%
Q17: before: 0.85s after: 0.96s 13.04%
Q18: before: 16.39s after: 15.47s -5.61%
Q19: before: 13.76s after: 13.01s -5.45%
Q20: before: 55.33s after: 55.12s -0.38%
Q21: before: 24.31s after: 24.31s -0.00%
Q22: before: 1.28s after: 1.41s 10.26%
```
Note that much better numbers were obtained on a buggy prototype. Those
numbers are invalid, and more details can be found
[here](cockroachdb#94438 (comment)).
At the moment, `coldata.Batch` that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.
A quick note on the TODOs sprinkled in this commit:
- `TODO(yuzefovich)` means that this will be left for 23.2 or later.
- `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1.
A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in `storage` package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
`fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems
ok to me given that the "meat" of the work there is still done by the
`pebbleMVCCScanner` which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.
Release note: None
cbb68d0 to
2ae0116
Compare
This sounds reasonable. However, setting reasonable value for Also, I just ran some quick benchmarks (using Still, we now know for sure how those "great" numbers from the prototype were obtained. |
michae2
left a comment
There was a problem hiding this comment.
Nice work! Thanks for waiting for me. Thank you for reworking those layer violations.
Interesting about the NumKeys / NumBytes omission causing the performance difference. Seems really promising, if we can figure out what else they affect besides BatchRequest limits.
Reviewed 2 of 55 files at r1, 2 of 15 files at r5, 11 of 22 files at r7, 7 of 10 files at r11, 2 of 3 files at r12, 10 of 11 files at r13, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @sumeerbhola, and @yuzefovich)
pkg/sql/vars.go line 490 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
My current understanding of the situation is that we no longer allow cluster settings with
sql.defaultsprefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.That said, I did ask in slack, will see what SQL Session team recommends.
TIL about ALTER ROLE not working for root. Ok, this is fine.
pkg/storage/col_mvcc.go line 229 at r13 (raw file):
// The KV wasn't added for whatever reason (e.g. it could have been // skipped over due to having been deleted), so just move on. return f.NextKV(ctx, mvccDecodingStrategy)
nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto or a loop?
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.
The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).
Here is the outline of all components involved:
┌────────────────────────────────────────────────┐
│ SQL │
│________________________________________________│
│ colfetcher.ColBatchDirectScan │
│ │ │
│ ▼ │
│ row.txnKVFetcher │
│ (behind the row.KVBatchFetcher interface) │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ KV Client │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ KV Server │
│________________________________________________│
│ colfetcher.cFetcherWrapper │
│ (behind the storage.CFetcherWrapper interface) │
│ │ │
│ ▼ │
│ colfetcher.cFetcher │
│ │ │
│ ▼ │
│ storage.mvccScanFetchAdapter ────────┐│
│ (behind the storage.NextKVer interface) ││
│ │ ││
│ ▼ ││
│ storage.pebbleMVCCScanner ││
│ (which put's KVs into storage.singleResults) <┘│
└────────────────────────────────────────────────┘
On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with
the `COL_BATCH_RESPONSE` format and returns the response (which contains the
columnar data) to the `colfetcher.ColBatchDirectScan`.
On the KV server side, we create a `storage.CFetcherWrapper` that asks the
`colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
`mvccScanFetchAdapter` which - via the `singleResults` struct - exposes
access to the current KV that the `pebbleMVCCScanner` is pointing at.
Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
`storage.singleResults.maybeTrimPartialLastRow` must be in sync with the
`colfetcher.cFetcher` which is achieved by
- the `cFetcher` exposing access to the first key of the last incomplete SQL
row via the `FirstKeyOfRowGetter`,
- the `singleResults` using that key as the resume key for the response,
- and the `cFetcher` removing that last partial SQL row when `NextKV()`
returns `partialRow=true`.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the `storage.pebbleResults` via tracking
offsets into the `pebbleResults.repr`.)
This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the `pebbleMVCCScanner` operates and the
"pull" mode that the `NextKVer` exposes. The adaption between two different
modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control
returns to it) the current unstable KV pair from the `singleResults` struct
which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into.
Second, in order be able to use the unstable KV pair without performing a
copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the
control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to
the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when
the next KV pair is needed.
There are multiple scenarios which are currently not supported:
- SQL cannot issue Get requests (likely will support in 23.1)
- `TraceKV` option is not supported (likely will support in 23.1)
- user-defined types other than enums are not supported (will _not_
support in 23.1)
- non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will _not_ support in 23.1).
The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.
The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719)
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.
The TPCH numbers, though, don't show the expected speedup:
```
Q1: before: 11.47s after: 8.84s -22.89%
Q2: before: 0.41s after: 0.29s -27.71%
Q3: before: 7.89s after: 9.68s 22.63%
Q4: before: 4.48s after: 4.52s 0.86%
Q5: before: 10.39s after: 10.35s -0.29%
Q6: before: 33.57s after: 33.41s -0.48%
Q7: before: 23.82s after: 23.81s -0.02%
Q8: before: 3.78s after: 3.76s -0.68%
Q9: before: 28.15s after: 28.03s -0.42%
Q10: before: 5.00s after: 4.98s -0.42%
Q11: before: 2.44s after: 2.44s 0.22%
Q12: before: 34.78s after: 34.65s -0.37%
Q13: before: 3.20s after: 2.94s -8.28%
Q14: before: 3.13s after: 3.21s 2.43%
Q15: before: 16.80s after: 16.73s -0.38%
Q16: before: 1.60s after: 1.65s 2.96%
Q17: before: 0.85s after: 0.96s 13.04%
Q18: before: 16.39s after: 15.47s -5.61%
Q19: before: 13.76s after: 13.01s -5.45%
Q20: before: 55.33s after: 55.12s -0.38%
Q21: before: 24.31s after: 24.31s -0.00%
Q22: before: 1.28s after: 1.41s 10.26%
```
Note that much better numbers were obtained on a buggy prototype. Those
numbers are invalid, and more details can be found
[here](cockroachdb#94438 (comment)).
At the moment, `coldata.Batch` that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.
A quick note on the TODOs sprinkled in this commit:
- `TODO(yuzefovich)` means that this will be left for 23.2 or later.
- `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1.
A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in `storage` package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
`fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems
ok to me given that the "meat" of the work there is still done by the
`pebbleMVCCScanner` which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.
Release note: None
2ae0116 to
556cb98
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I needed to adjust one of the upgrade tests which was incorrectly written.
Thanks for the reviews everyone!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/storage/col_mvcc.go line 229 at r13 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either
gotoor a loop?
I didn't see any usages of goto in our codebase, so I opted for the loop. However, I'm a bit concerned that the loop will have noticeable performance impact, so I left a TODO to explore this - if it does have a noticeable impact, then I'd rather keep the recursion since for the recursion to lead to a crash we'd need like 1 billion KVs to be skipped which I'm assuming should not be possible if the compaction are working as intended.
|
Build succeeded: |
This commit introduces a new
COL_BATCH_RESPONSEscan format for Scansand ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.
The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).
Here is the outline of all components involved:
On the KV client side,
row.txnKVFetcherissues Scans and ReverseScans withthe
COL_BATCH_RESPONSEformat and returns the response (which contains thecolumnar data) to the
colfetcher.ColBatchDirectScan.On the KV server side, we create a
storage.CFetcherWrapperthat asks thecolfetcher.cFetcherfor the nextcoldata.Batch. ThecFetcher, in turn,fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
mvccScanFetchAdapterwhich - via thesingleResultsstruct - exposesaccess to the current KV that the
pebbleMVCCScanneris pointing at.Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
storage.singleResults.maybeTrimPartialLastRowmust be in sync with thecolfetcher.cFetcherwhich is achieved bycFetcherexposing access to the first key of the last incomplete SQLrow via the
FirstKeyOfRowGetter,singleResultsusing that key as the resume key for the response,cFetcherremoving that last partial SQL row whenNextKV()returns
partialRow=true.This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the
storage.pebbleResultsvia trackingoffsets into the
pebbleResults.repr.)This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the
pebbleMVCCScanneroperates and the"pull" mode that the
NextKVerexposes. The adaption between two differentmodes is achieved via the
mvccScanFetcherAdaptergrabbing (when the controlreturns to it) the current unstable KV pair from the
singleResultsstructwhich serves as a one KV pair buffer that the
pebbleMVCCScannerputs into.Second, in order be able to use the unstable KV pair without performing a
copy, the
pebbleMVCCScannerstops at the current KV pair and returns thecontrol flow (which is exactly what
pebbleMVCCScanner.getOnedoes) back tothe
mvccScanFetcherAdapter, with the adapter advancing the scanner only whenthe next KV pair is needed.
There are multiple scenarios which are currently not supported:
TraceKVoption is not supported (likely will support in 23.1)support in 23.1)
are not supported (will not support in 23.1).
The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.
The microbenchmarks show
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.
The TPCH numbers, though, don't show the expected speedup:
At the moment,
coldata.Batchthat is included into the response isalways serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.
A quick note on the TODOs sprinkled in this commit:
TODO(yuzefovich)means that this will be left for 23.2 or later.TODO(yuzefovich, 23.1)means that it should be addressed in 23.1.A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in
storagepackage) seems rather annoying since we mustcreate keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
fetchpb.IndexFetchSpec. Not having unit tests in thestorageseemsok to me given that the "meat" of the work there is still done by the
pebbleMVCCScannerwhich is exercised using the regular Scans.End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.
Addresses: #82323.
Informs: #87610.
Epic: CRDB-14837
Release note: None