Skip to content

kv: speed up the split trigger, don't hold latches while computing stats #22348

@tbg

Description

@tbg

We want to support (and possibly default to) ranges of sizes much larger than our current default of 64mb. A part of this is improving snapshots, but another problem is that splitting a large range is O(range size) due to the linear scan needed to recompute the MVCC stats of one half. Even worse, this split blocks pretty much all read/write access to the range while it happens which is not ideal today but is going to be a lot worse when more data (rows) is affected by this blockage and the trigger takes longer at the same time.

This is mostly a semi-organized brain dump. I think there are two things we should do, namely first make the trigger a lot faster, and secondly (and less importantly after 1) make the split allow more concurrent commands.

Take the stats computation out of the split trigger.

We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation (we have this machinery already). The upside is that that's easy, the downside is that we rely on MVCCStats' accuracy for rebalancing and potentially other decisions.

Or we could do more work and make the LHS/RHS stats available in the split trigger. The basic idea is to trigger a stats recomputation for the LHS with a custom Raft command, and maintain the MVCCStats deltas of updates to the LHS and RHS separately. This can be done without declaring any keys (i.e. without making anything wait). Then, once the base computation is done, commit the split (supplying it with the base computation, and letting it access the accumulated LHS/RHS deltas from the Range state).

The main complication with this is that commands can overlap the split key (and so can't be binned into LHS/RHS). I think there's something here where we can make splits three-phase:

  1. Update the meta records and inserts the records that the split would, but with a flag that indicates that this is just an "accounting split". The idea is to make our routing layer not send requests that overlap the split point (I think adding two booleans provisional_{left,right} is enough). This also updates the range descriptor with the split key so that the range henceforth rejects commands overlapping the split. The downside is that various components that read meta records and make decisions based on them must know about the flag. This is hopefully not too onerous if RangeLookup is made aware of this complication.

  2. Run the stats checkpoint.

  3. Run the final split transaction, which updates the meta writes (making them non-provisional) and runs the split trigger (which now uses the checkpoint).

As an alternative to 1) we could also instruct the range to keep a RHS delta as it evaluates commands. This might be easier overall but requires some DistSender-like functionality to be made available to the storage layer.

Declare less keys.

At the time of writing, the split declares a lot of keys:

  • RW all user keys in both RHS and LHS. For the LHS, I don't understand why that is (@bdarnell?); it seems that declaring a read should be enough to block concurrent writes (mod making sure there's proper locking as the tsCache gets copied over). And pulling on that thread, I think once we have the stats checkpoint we don't need to block writes to the LHS at all. For the RHS, the same applies, though after the split trigger applies they'll catch a mismatch error, making blocking perhaps a better option (or we're willing to add some extra complexity that re-routes those requests to the newly created RHS replica instead).
  • RW on LHS+RHS local keys (txn records, etc). The same logic as for the user keys should apply (I don't see a difference in how splits handle these).
  • RW on LHS rangeid-bound keys (RHS too, but nobody accesses that prior to the split, so we don't care to improve it). I can understand declaring the read as we copy a lot of information into the RHS (abort cache, GC threshold, etc) but it's not clear to me why we declare a write.

PS: I think the comment in declareKeysEndTransaction on the above in the code is inaccurate. @bdarnell I'm sure I'm missing some subtleties here.

Epic CRDB-34215

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-23.2-scale-testingissues found during 23.2 scale testingO-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsO-testclusterIssues found or occurred on a test cluster, i.e. a long-running internal clusterP-3Issues/test failures with no fix SLAT-kvKV TeamX-nostaleMarks an issue/pr that should be ignored by the stale botsync-me

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions