Skip to content

storage: optimize the GC queue #20052

@tbg

Description

@tbg

Suggested work (in order)

  • incremental updates of txn records' intents
  • use Scans to "resolve" txn records' intents
  • introduce push/resolve pipeline
  • introduce key removal pipeline

Overall goal: make sure the GC queue performs "OK" in the presence of many
GC'able keys, very long keys (1MB++), many intents, long intent keys, many
transaction records, and transaction records with many contained intents.

See, for example, #19216 (comment), #18199 (comment).

Work chunking

Keys

The GC queue currently scans through all (versions of all) key value pairs in
the Range in an attempt to identify those ripe for removal. The problem with
this is that due to context cancellation, the work done in assembling these
keys may be in vain should the context expire prematurely.

We should batch up and send out the requests as we go. The right metric to
decide whether a batch is worth sending is whether the "key bytes" (i.e. the sum
of lengths of (MVCC) keys which are to be deleted) exceeds a threshold (256kb
seems reasonable).

See

Intents

While scanning through the data, the process also attempts to resolve the
(non-recent) intents it encounters, which involves pushing the respective
transaction (with high priority) and then removing the intents. Currently, this
is similarly naive: it populates a map from transaction ID to list of intents,
and, after having iterated through the whole range, pushes the transactions in
parallel, and then passes all resolvable intents to the intentResolver (which
internally chunks and executes in parallel).

The push-resolve phase, though parallelized, blocks the queue from making any
other process in the meantime, and in the worst case, the whole range is held
in memory (if the range is 100% old intents).

Here, similarly, transactions should be pushed once first encountered, and
intents staged for resolution, to be handled (in batches, this time limited by
max(threshold_num, threshold_bytes)) as their respective transaction's outcome
is known.

Transaction record cleanup

Problem: picture a transaction that wrote 10000 intents. The corresponding
transaction record will have 10000 key spans contained within. Assume for
drama that each of the key spans is actually a whole range (64mb worth of
data). A DELETE FROM x WHERE true might do such a thing if implemented
poorly.

  • if the record is ABORTED, we can just remove it (though it would be nice to
    also remove some of the intents)
  • if it is COMMITTED, we have to guarantee that the intents are removed
    before removing the txn record.

The latter case has the problem that resolving 10k intents can take a long time,
and that the duration of the queue run is constrained by a context (at the time
of writing, to one minute). We currently do the naive thing: try to resolve all
of the intents, then stage the record for deletion. The result will be that on
every GC run, we send some of the intent removals, but fail to remove the record
due to context cancellation. On the next GC run, we try again, and fail again,
etc. Even worse, due to that failure, we won't actually GC any of the keys
staged for deletion. Correspondingly, the GC queue score remains high, and we
loop (though the "Keys" section addresses this).

To address this:

  • it is generally expected that most of the intents aren't there any more, so instead
    of ResolveIntent{,Range} we can just issue a Get or Scan, which is much cheaper
    due to not requiring Raft or the command queue in the case in which there is no intent.
  • If there are intents, the intent-local intentResolver will resolve them for
    us.
  • For aborted txns: can pass the offending txn record key to the "key deletion pipeline"
    (see "Keys" section above) right away. Resolving intents is optional and may even be a
    pessimization.
  • Instead of sending a large batch for all (say) 10k entries, chunk them
    into (say) scans of at most 100 each (or 256kb in size), in sorted order (they
    may be sorted already; have to check).
  • For committed txns: after each batch, update the transaction record with the
    intents removed (which could mean
    removing the entry altogether).
  • we don't care about the results, so we could augment Scan with an optional flag to
    discard the results. This seems relatively important since wide key ranges (as
    in this example) would otherwise batch up gigabytes of data.

Out of scope

Triggering txn span cleanup

  • we don't track any metrics on the size of the transaction span, so in theory
    there is potential for transactions to accumulate here that "never" get GC'ed,
    making the range large
  • in practice, this seems rather difficult to achieve, so it doesn't seem like a
    priority and we clean stick with cleaning out old transactions as a side
    effect of key/intent GC
  • to do better, we would start tracking the number of transaction records in
    MVCCStats and make a new
    queue that only handles the transaction records
  • actually would likely want to track sum(abort span, txn span)

Dealing with time constraints

The queues' processing of a single replica is currently time limited via context
cancellation. This doesn't allow introspection of the time remaining, which is
unfortunate. Instead, all queues' process method could accept an additional
timeout parameter (and preferably an unconstrained context). Most implementations
will just put the timeout on the context and work as usual, but the gc queue may
want to be smarter and leave sufficient time to "wrap up" before cancellation.

This may not be necessary if the GC queue stages all of its work in small chunks.

Metadata

Metadata

Labels

S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions