storage: optimize the GC queue

## Suggested work (in order)

- [ ] incremental updates of txn records' intents
- [ ] use Scans to "resolve" txn records' intents
- [ ] introduce push/resolve pipeline
- [x] introduce key removal pipeline

Overall goal: make sure the GC queue performs "OK" in the presence of many
GC'able keys, very long keys (1MB++), many intents, long intent keys, many
transaction records, and transaction records with many contained intents.

See, for example, https://github.com/cockroachdb/cockroach/issues/19216#issuecomment-339419528, https://github.com/cockroachdb/cockroach/issues/18199#issuecomment-329747784.

## Work chunking

### Keys

The GC queue currently scans through all (versions of all) key value pairs in
the Range in an attempt to identify those ripe for removal. The problem with
this is that due to context cancellation, the work done in assembling these
keys may be in vain should the context expire prematurely.

We should batch up and send out the requests as we go. The right metric to
decide whether a batch is worth sending is whether the "key bytes" (i.e. the sum
of lengths of (MVCC) keys which are to be deleted) exceeds a threshold (256kb
seems reasonable).

See 

### Intents

While scanning through the data, the process also attempts to resolve the
(non-recent) intents it encounters, which involves pushing the respective
transaction (with high priority) and then removing the intents. Currently, this
is similarly naive: it populates a map from transaction ID to list of intents,
and, after having iterated through the whole range, pushes the transactions in
parallel, and then passes all resolvable intents to the `intentResolver` (which
internally chunks and executes in parallel).

The push-resolve phase, though parallelized, blocks the queue from making any
other process in the meantime, and in the worst case, the whole range is held
in memory (if the range is 100% old intents).

Here, similarly, transactions should be pushed once first encountered, and
intents staged for resolution, to be handled (in batches, this time limited by
max(threshold_num, threshold_bytes)) as their respective transaction's outcome
is known.

## Transaction record cleanup

Problem: picture a transaction that wrote 10000 intents. The corresponding
transaction record will have 10000 key spans contained within. Assume for
drama that each of the key spans is actually a whole range (64mb worth of
data). A `DELETE FROM x WHERE true` might do such a thing if implemented
poorly.

- if the record is `ABORTED`, we can just remove it (though it would be nice to
  also remove some of the intents)
- if it is `COMMITTED`, we have to guarantee that the intents are removed
  *before* removing the txn record.

The latter case has the problem that resolving 10k intents can take a long time,
and that the duration of the queue run is constrained by a context (at the time
of writing, to one minute). We currently do the naive thing: try to resolve all
of the intents, then stage the record for deletion. The result will be that on
every GC run, we send some of the intent removals, but fail to remove the record
due to context cancellation. On the next GC run, we try again, and fail again,
etc. Even worse, due to that failure, we won't actually GC any of the keys
staged for deletion. Correspondingly, the GC queue score remains high, and we
loop (though the "Keys" section addresses this).

To address this:

- it is generally expected that most of the intents aren't there any more, so instead
  of `ResolveIntent{,Range}` we can just issue a `Get` or `Scan`, which is much cheaper
  due to not requiring Raft or the command queue in the case in which there is no intent.
- If there are intents, the intent-local `intentResolver` will resolve them for
  us.
- For aborted txns: can pass the offending txn record key to the "key deletion pipeline"
  (see "Keys" section above) right away. Resolving intents is optional and may even be a
  pessimization.
- Instead of sending a large batch for all (say) 10k entries, chunk them
  into (say) scans of at most 100 each (or 256kb in size), in sorted order (they
  may be sorted already; have to check).
- For committed txns: after each batch, update the transaction record with the
intents removed (which could mean
  removing the entry altogether).
- we don't care about the results, so we could augment `Scan` with an optional flag to
  discard the results. This seems relatively important since wide key ranges (as
  in this example) would otherwise batch up gigabytes of data.

### Out of scope

## Triggering txn span cleanup

- we don't track any metrics on the size of the transaction span, so in theory
  there is potential for transactions to accumulate here that "never" get GC'ed,
  making the range large
- in practice, this seems rather difficult to achieve, so it doesn't seem like a
  priority and we clean stick with cleaning out old transactions as a side
  effect of key/intent GC
- to do better, we would start tracking the number of transaction records in
  `MVCCStats` and make a new
  queue that only handles the transaction records
- actually would likely want to track `sum(abort span, txn span)`

## Dealing with time constraints

The queues' processing of a single replica is currently time limited via context
cancellation. This doesn't allow introspection of the time remaining, which is
unfortunate. Instead, all queues' `process` method could accept an additional
timeout parameter (and preferably an unconstrained context). Most implementations
will just put the timeout on the context and work as usual, but the gc queue may
want to be smarter and leave sufficient time to "wrap up" before cancellation.

This may not be necessary if the GC queue stages all of its work in small chunks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: optimize the GC queue #20052

Suggested work (in order)

Work chunking

Keys

Intents

Transaction record cleanup

Out of scope

Triggering txn span cleanup

Dealing with time constraints

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: optimize the GC queue #20052

Description

Suggested work (in order)

Work chunking

Keys

Intents

Transaction record cleanup

Out of scope

Triggering txn span cleanup

Dealing with time constraints

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions