sql,storage: add support for COL_BATCH_RESPONSE scan format by yuzefovich · Pull Request #94438 · cockroachdb/cockroach

yuzefovich · 2022-12-29T21:28:07Z

This commit introduces a new COL_BATCH_RESPONSE scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.

The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).

Here is the outline of all components involved:

     ┌────────────────────────────────────────────────┐
     │                       SQL                      │
     │________________________________________________│
     │          colfetcher.ColBatchDirectScan         │
     │                        │                       │
     │                        ▼                       │
     │                 row.txnKVFetcher               │
     │    (behind the row.KVBatchFetcher interface)   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Client                   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Server                   │
     │________________________________________________│
     │           colfetcher.cFetcherWrapper           │
     │ (behind the storage.CFetcherWrapper interface) │
     │                        │                       │
     │                        ▼                       │
     │              colfetcher.cFetcher               │
     │                        │                       │
     │                        ▼                       │
     │          storage.mvccScanFetchAdapter ────────┐│
     │    (behind the storage.NextKVer interface)    ││
     │                        │                      ││
     │                        ▼                      ││
     │           storage.pebbleMVCCScanner           ││
     │ (which put's KVs into storage.singleResults) <┘│
     └────────────────────────────────────────────────┘

On the KV client side, row.txnKVFetcher issues Scans and ReverseScans with
the COL_BATCH_RESPONSE format and returns the response (which contains the
columnar data) to the colfetcher.ColBatchDirectScan.

On the KV server side, we create a storage.CFetcherWrapper that asks the
colfetcher.cFetcher for the next coldata.Batch. The cFetcher, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
mvccScanFetchAdapter which - via the singleResults struct - exposes
access to the current KV that the pebbleMVCCScanner is pointing at.

Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
storage.singleResults.maybeTrimPartialLastRow must be in sync with the
colfetcher.cFetcher which is achieved by

the cFetcher exposing access to the first key of the last incomplete SQL
row via the FirstKeyOfRowGetter,
the singleResults using that key as the resume key for the response,
and the cFetcher removing that last partial SQL row when NextKV()
returns partialRow=true.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the storage.pebbleResults via tracking
offsets into the pebbleResults.repr.)

This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the pebbleMVCCScanner operates and the
"pull" mode that the NextKVer exposes. The adaption between two different
modes is achieved via the mvccScanFetcherAdapter grabbing (when the control
returns to it) the current unstable KV pair from the singleResults struct
which serves as a one KV pair buffer that the pebbleMVCCScanner puts into.
Second, in order be able to use the unstable KV pair without performing a
copy, the pebbleMVCCScanner stops at the current KV pair and returns the
control flow (which is exactly what pebbleMVCCScanner.getOne does) back to
the mvccScanFetcherAdapter, with the adapter advancing the scanner only when
the next KV pair is needed.

There are multiple scenarios which are currently not supported:

SQL cannot issue Get requests (likely will support in 23.1)
TraceKV option is not supported (likely will support in 23.1)
user-defined types other than enums are not supported (will not
support in 23.1)
non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will not support in 23.1).

The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.

The microbenchmarks show
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.

The TPCH numbers, though, don't show the expected speedup:

Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%

At the moment, coldata.Batch that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.

A quick note on the TODOs sprinkled in this commit:

TODO(yuzefovich) means that this will be left for 23.2 or later.
TODO(yuzefovich, 23.1) means that it should be addressed in 23.1.

A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in storage package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
fetchpb.IndexFetchSpec. Not having unit tests in the storage seems
ok to me given that the "meat" of the work there is still done by the
pebbleMVCCScanner which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.

Addresses: #82323.
Informs: #87610.

Epic: CRDB-14837

Release note: None

cockroach-teamcity · 2022-12-29T21:28:15Z

This change is

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @sumeerbhola)

pkg/roachpb/api.proto line 673 at r7 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] rowin -> row in and same above.

Done.

pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think I still want to push on this a bit more, since it would be nice to decouple things. What if we added a method to the NextKVer interface SetFirstKVOfRow that would be called by the cFetcher when it reaches stateDecodeFirstKVOfRow? I think that should give mvccScanFetchAdapter all the information it needs to implement my earlier suggestion, right?

Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow on every row seen by the cFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.

Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults returning the appropriate first key of the trimmed row and the cFetcher actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.

pkg/storage/col_mvcc.go line 110 at r7 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] stable now just depends on the implementation and is always true or false, right? Consider either making it a separate method on the interface or making it a field on the cFetcher itself.

Good point, extracted into another method in NextKVer interface.

DrewKimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @sumeerbhola, and @yuzefovich)

pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow on every row seen by the cFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.

Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults returning the appropriate first key of the trimmed row and the cFetcher actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.

One last option - we could make the first key of the row an argument to NextKV.

Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.

sumeerbhola

minor nits about comments. I am happy with the storage interfaces and code, so feel free to merge when @DrewKimball is satisfied.

Reviewed 1 of 3 files at r10.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)

pkg/storage/col_mvcc.go line 246 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This part was very confusing - although the KV is unstable, in some cases the cFetcher doesn't need to hold onto it after the following NextKV call, so the copying is not needed (if there is only one column family in the table). The complication was that a copy needed to be performed for a very different reason in the old code path (in the row.KVFetcher). I refactored the code to clean things up, and now the implementation here should make more sense. In particular, we now just say that the returned KV is unstable, and the cFetcher itself decides whether to deep copy it or not.

Looks much cleaner now.

pkg/storage/col_mvcc.go line 124 at r10 (raw file):

	// When (ok=false,partialRow=true) is returned, the caller is expected to
	// discard all KVs that were part of the last SQL row that was incomplete.
	// The scan will be resumed from the last key provided by SetFirstKeyOfRow

was this supposed to say FirstKeyOfRowGetter?

pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):

To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption

getOne could call put with key a and then enquire about key b, even though it has no intention of call put with key b in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults implementation of results. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)

pkg/sql/vars.go line 490 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

IIRC ALTER ROLE ALL doesn't affect root user, and for logic tests I think we do use root often. The direct columnar scans need to be disabled for some types of logic tests, and it seems like we do need an explicit cluster setting to handle that. Perhaps we'd just need to do both ALTER ROLE ALL and ALTER ROLE root. Happy to be taught here since the choice of using the cluster setting can be driven simply by me being used to this way.

My current understanding of the situation is that we no longer allow cluster settings with sql.defaults prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.

That said, I did ask in slack, will see what SQL Session team recommends.

pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

One last option - we could make the first key of the row an argument to NextKV.

Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.

I like Getter method approach the most - passing a key when calling NextKV (which returns a key) seems wrong.

pkg/storage/col_mvcc.go line 124 at r10 (raw file):

Previously, sumeerbhola wrote…

was this supposed to say FirstKeyOfRowGetter?

Yes, this was a leftover, fixed.

pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):

Previously, sumeerbhola wrote…

To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption

getOne could call put with key a and then enquire about key b, even though it has no intention of call put with key b in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults implementation of results. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.

Thanks, makes sense, added this information into the comments.

DrewKimball

Nicely done!

Reviewed 8 of 22 files at r7, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @sumeerbhola, and @yuzefovich)

yuzefovich · 2023-01-24T04:42:41Z

I've got some performance numbers to share.

The microbenchmarks are as expected:

Count, SkipScan, and WideTableIgnoreColumns see a speed up for separate-process tenants. These benchmarks select only a subset of needed columns, so we're able to significantly reduce the amount of data to be transferred across process boundaries.
Scan and ScanFilter select all of the columns in the table, so we see a minor performance hit.
in-process tenants also see a minor performance hit. This will be improved in storage,colfetcher: implement local fast-path for COL_BATCH_RESPONSE #95033 where we will introduce a local fastpath that will avoid serialization of coldata.Batches
ScanFilter/limit=50 has the largest regression probably because we don't (yet) propagate the estimated row count. This should be fixed in 23.1.

TPCH numbers though aren't as good:

Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%

This is a bit disappointing given the original very good numbers some time ago in the prototype. I did a bisect last week and couldn't reproduce those results. I'll do another bisect tomorrow, but most likely I messed something up when obtaining those numbers (although I do remember collecting them twice).

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)

pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):

	// returned on the NextKVer.NextKV call). Therefore, the singleResults can
	// make a determination on its own whether the given key belongs to the
	// first SQL row.

Now that singleResults is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put, so even if that KV is still buffered, it is ok.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)

pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):

Previously, sumeerbhola wrote…

Now that singleResults is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put, so even if that KV is still buffered, it is ok.

Yes, you're right. I tightened the contract here a bit further since continuesFirstRow is only called if put has already been called at least once.

yuzefovich · 2023-01-24T22:16:42Z

For reference, with the local fastpath for in-memory tenants (WIP in #95033) the benchmarks are here (the comparison is master vs this PR + #95033 with direct columnar scans always enabled). They show that for in-memory tenants (this is how UA will be achieved for self-hosted customers) the KV projection pushdown work already has benefits in some scenarios, and I hope with more targeted optimization work we can improve there noticeably.

yuzefovich · 2023-01-25T01:08:04Z

Finally, I was able to reproduce the "great" results I obtained on the prototype two months ago. In short, they were due to a subtle bug in col_mvcc.go which didn't impact the correctness.

In particular, as can be seen in #95793 (ignoring the last commit) the prototype had the following:

	//res.NumKeys = mvccScanner.results.count
	//res.NumBytes = mvccScanner.results.bytes

after processing the Scan request via the direct columnar scans. As a result, after evaluating the Scan we didn't update limits of the BatchRequest (namely MaxSpanRequestKeys and TargetBytes) in DistSender.divideAndSendBatchToRanges which meant that these limits were applied on a per-ScanRequest basis rather than per-BatchRequest basis. This meant that we had to do fewer KV calls from the SQL land which improved the performance. The downside though is that the stability suffered since we could now receive a BatchResponse that exceeded TargetBytes significantly. This wasn't caught since the query results were still correct, we were effectively issuing BatchRequests with larger limits. I didn't run any "stability" tests on the prototype that would catch this.

Once this bug is fixed (i.e. the last commit in #95793 is applied), we get the results that show effectively no difference of the prototype on TPCH queries:

Q1:	before: 8.57s	after: 8.41s	 -1.84%
Q2:	before: 0.29s	after: 0.29s	 1.39%
Q3:	before: 9.55s	after: 9.68s	 1.35%
Q4:	before: 4.45s	after: 4.49s	 0.92%
Q5:	before: 10.39s	after: 10.35s	 -0.32%
Q6:	before: 33.58s	after: 33.46s	 -0.34%
Q7:	before: 23.82s	after: 23.82s	 0.02%
Q8:	before: 3.79s	after: 3.76s	 -0.74%
Q9:	before: 28.13s	after: 28.01s	 -0.45%
Q10:	before: 4.99s	after: 4.98s	 -0.32%
Q11:	before: 2.44s	after: 2.45s	 0.29%
Q12:	before: 34.75s	after: 34.63s	 -0.35%
Q13:	before: 3.15s	after: 2.93s	 -6.93%
Q14:	before: 3.27s	after: 3.24s	 -1.10%
Q15:	before: 16.82s	after: 16.73s	 -0.52%
Q16:	before: 1.80s	after: 1.71s	 -5.00%
Q17:	before: 1.01s	after: 0.94s	 -6.55%
Q18:	before: 30.14s	after: 28.46s	 -5.56%
Q19:	before: 13.77s	after: 13.01s	 -5.51%
Q20:	before: 55.32s	after: 55.12s	 -0.36%
Q21:	before: 24.32s	after: 24.33s	 0.05%
Q22:	before: 1.48s	after: 1.46s	 -1.69%

(It's interesting to observe that on the current polished PR there are some noticeable improvements and noticeable regressions, I haven't explored this in-depth though, and will do so in the next milestone.)

This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set GOMEMLIMIT on default, increasing these limits (I'm thinking mostly about the default value of TargetBytes which is currently 10MiB) will not significantly impact the stability. However, this is an orthogonal question.

Thus, I think there are no more blockers for this PR to be merged. It shows an expected improvement in microbenchmarks when only a subset of columns is needed in multi-tenant separate-process setups. I'll wait for Michael's approval.

DrewKimball · 2023-01-25T01:17:15Z

This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set GOMEMLIMIT on default, increasing these limits (I'm thinking mostly about the default value of TargetBytes which is currently 10MiB) will not significantly impact the stability. However, this is an orthogonal question.

Maybe this isn't the best place to discuss, but I will anyway :) Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the ResetMaybeReallocate stuff we already have does this pretty naturally. Maybe we could have some visibility at the operator level of how much memory is being used at a higher level, e.g. by all SQL queries, or maybe just by the current query. We could increase the actual limit from the default when overall memory usage is low, and decrease again when resource use is high (like would be the case for many concurrent queries).

This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None

yuzefovich · 2023-01-25T04:02:38Z

Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the ResetMaybeReallocate stuff we already have does this pretty naturally. Maybe we could have some visibility at the operator level of how much memory is being used at a higher level, e.g. by all SQL queries, or maybe just by the current query. We could increase the actual limit from the default when overall memory usage is low, and decrease again when resource use is high (like would be the case for many concurrent queries).

This sounds reasonable. However, setting reasonable value for GOMEMLIMIT by default seems like a must-have prerequisite to me for doing any of this work, so I don't think we'll be able to do anything here in 23.1.

Also, I just ran some quick benchmarks (using tpchvec/bench and multitenant/tpch) while varying the default value for rowinfra.defaultBatchBytesLimitProductionValue (which is where 10MiB default comes from), and they didn't show significant improvement when bumping it up to 20MiB or 100MiB, neither in single-tenant nor in multi-tenant world. Perhaps my explanation in #94438 (comment) is incomplete. This means that further investigation is needed to understand how exactly not setting NumKeys and/or NumBytes leads to the perf speedup.

Still, we now know for sure how those "great" numbers from the prototype were obtained.

michae2

Nice work! Thanks for waiting for me. Thank you for reworking those layer violations.

Interesting about the NumKeys / NumBytes omission causing the performance difference. Seems really promising, if we can figure out what else they affect besides BatchRequest limits.

Reviewed 2 of 55 files at r1, 2 of 15 files at r5, 11 of 22 files at r7, 7 of 10 files at r11, 2 of 3 files at r12, 10 of 11 files at r13, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @sumeerbhola, and @yuzefovich)

pkg/sql/vars.go line 490 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

My current understanding of the situation is that we no longer allow cluster settings with sql.defaults prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.

That said, I did ask in slack, will see what SQL Session team recommends.

TIL about ALTER ROLE not working for root. Ok, this is fine.

pkg/storage/col_mvcc.go line 229 at r13 (raw file):

		// The KV wasn't added for whatever reason (e.g. it could have been
		// skipped over due to having been deleted), so just move on.
		return f.NextKV(ctx, mvccDecodingStrategy)

nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto or a loop?

This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None

Sumeer gave LGTM

yuzefovich

I needed to adjust one of the upgrade tests which was incorrectly written.

Thanks for the reviews everyone!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)

pkg/storage/col_mvcc.go line 229 at r13 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto or a loop?

I didn't see any usages of goto in our codebase, so I opted for the loop. However, I'm a bit concerned that the loop will have noticeable performance impact, so I left a TODO to explore this - if it does have a noticeable impact, then I'd rather keep the recursion since for the recursion to lead to a crash we'd need like 1 billion KVs to be skipped which I'm assuming should not be possible if the compaction are working as intended.

craig · 2023-01-25T20:33:27Z

Build succeeded:

Bazel Essential CI (Cockroach)

yuzefovich force-pushed the kv-pushdown-all branch 17 times, most recently from 186d49a to a284dbb Compare January 5, 2023 19:27

yuzefovich force-pushed the kv-pushdown-all branch 7 times, most recently from 07f8ff0 to f2fc46d Compare January 6, 2023 20:37

yuzefovich mentioned this pull request Jan 6, 2023

server: separate config defaults for serverless vs dedicated/SH secondary tenants #94856

Closed

yuzefovich force-pushed the kv-pushdown-all branch 4 times, most recently from d24e48c to 26940e1 Compare January 6, 2023 22:33

yuzefovich force-pushed the kv-pushdown-all branch from 3cb7225 to c6a32ca Compare January 21, 2023 05:22

yuzefovich commented Jan 21, 2023

View reviewed changes

DrewKimball reviewed Jan 21, 2023

View reviewed changes

sumeerbhola reviewed Jan 23, 2023

View reviewed changes

yuzefovich force-pushed the kv-pushdown-all branch from c6a32ca to 00362eb Compare January 23, 2023 20:02

yuzefovich commented Jan 23, 2023

View reviewed changes

DrewKimball approved these changes Jan 23, 2023

View reviewed changes

sumeerbhola reviewed Jan 24, 2023

View reviewed changes

yuzefovich force-pushed the kv-pushdown-all branch from 00362eb to cbb68d0 Compare January 24, 2023 19:56

yuzefovich commented Jan 24, 2023

View reviewed changes

This was referenced Jan 25, 2023

[DNM] KV pushdown bisect #95793

Closed

WIP on KV projection pushdown #92294

Closed

yuzefovich force-pushed the kv-pushdown-all branch from cbb68d0 to 2ae0116 Compare January 25, 2023 03:57

michae2 approved these changes Jan 25, 2023

View reviewed changes

yuzefovich force-pushed the kv-pushdown-all branch from 2ae0116 to 556cb98 Compare January 25, 2023 18:01

yuzefovich requested a review from a team January 25, 2023 18:01

yuzefovich commented Jan 25, 2023

View reviewed changes

craig bot merged commit 2105cd4 into cockroachdb:master Jan 25, 2023

yuzefovich deleted the kv-pushdown-all branch January 25, 2023 20:34

renatolabs mentioned this pull request Jan 26, 2023

sql: data race introduced in #94438 #95937

Closed

yuzefovich mentioned this pull request Jan 26, 2023

[DNM] sql,storage: add support for COL_BATCH_RESPONSE scan format #96015

Closed

Conversation

yuzefovich commented Dec 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 29, 2022

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Jan 24, 2023

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Jan 24, 2023

Uh oh!

yuzefovich commented Jan 25, 2023

Uh oh!

DrewKimball commented Jan 25, 2023

Uh oh!

yuzefovich commented Jan 25, 2023

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jan 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuzefovich commented Dec 29, 2022 •

edited

Loading