-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: optimize the GC queue #20052
Description
Suggested work (in order)
- incremental updates of txn records' intents
- use Scans to "resolve" txn records' intents
- introduce push/resolve pipeline
- introduce key removal pipeline
Overall goal: make sure the GC queue performs "OK" in the presence of many
GC'able keys, very long keys (1MB++), many intents, long intent keys, many
transaction records, and transaction records with many contained intents.
See, for example, #19216 (comment), #18199 (comment).
Work chunking
Keys
The GC queue currently scans through all (versions of all) key value pairs in
the Range in an attempt to identify those ripe for removal. The problem with
this is that due to context cancellation, the work done in assembling these
keys may be in vain should the context expire prematurely.
We should batch up and send out the requests as we go. The right metric to
decide whether a batch is worth sending is whether the "key bytes" (i.e. the sum
of lengths of (MVCC) keys which are to be deleted) exceeds a threshold (256kb
seems reasonable).
See
Intents
While scanning through the data, the process also attempts to resolve the
(non-recent) intents it encounters, which involves pushing the respective
transaction (with high priority) and then removing the intents. Currently, this
is similarly naive: it populates a map from transaction ID to list of intents,
and, after having iterated through the whole range, pushes the transactions in
parallel, and then passes all resolvable intents to the intentResolver (which
internally chunks and executes in parallel).
The push-resolve phase, though parallelized, blocks the queue from making any
other process in the meantime, and in the worst case, the whole range is held
in memory (if the range is 100% old intents).
Here, similarly, transactions should be pushed once first encountered, and
intents staged for resolution, to be handled (in batches, this time limited by
max(threshold_num, threshold_bytes)) as their respective transaction's outcome
is known.
Transaction record cleanup
Problem: picture a transaction that wrote 10000 intents. The corresponding
transaction record will have 10000 key spans contained within. Assume for
drama that each of the key spans is actually a whole range (64mb worth of
data). A DELETE FROM x WHERE true might do such a thing if implemented
poorly.
- if the record is
ABORTED, we can just remove it (though it would be nice to
also remove some of the intents) - if it is
COMMITTED, we have to guarantee that the intents are removed
before removing the txn record.
The latter case has the problem that resolving 10k intents can take a long time,
and that the duration of the queue run is constrained by a context (at the time
of writing, to one minute). We currently do the naive thing: try to resolve all
of the intents, then stage the record for deletion. The result will be that on
every GC run, we send some of the intent removals, but fail to remove the record
due to context cancellation. On the next GC run, we try again, and fail again,
etc. Even worse, due to that failure, we won't actually GC any of the keys
staged for deletion. Correspondingly, the GC queue score remains high, and we
loop (though the "Keys" section addresses this).
To address this:
- it is generally expected that most of the intents aren't there any more, so instead
ofResolveIntent{,Range}we can just issue aGetorScan, which is much cheaper
due to not requiring Raft or the command queue in the case in which there is no intent. - If there are intents, the intent-local
intentResolverwill resolve them for
us. - For aborted txns: can pass the offending txn record key to the "key deletion pipeline"
(see "Keys" section above) right away. Resolving intents is optional and may even be a
pessimization. - Instead of sending a large batch for all (say) 10k entries, chunk them
into (say) scans of at most 100 each (or 256kb in size), in sorted order (they
may be sorted already; have to check). - For committed txns: after each batch, update the transaction record with the
intents removed (which could mean
removing the entry altogether). - we don't care about the results, so we could augment
Scanwith an optional flag to
discard the results. This seems relatively important since wide key ranges (as
in this example) would otherwise batch up gigabytes of data.
Out of scope
Triggering txn span cleanup
- we don't track any metrics on the size of the transaction span, so in theory
there is potential for transactions to accumulate here that "never" get GC'ed,
making the range large - in practice, this seems rather difficult to achieve, so it doesn't seem like a
priority and we clean stick with cleaning out old transactions as a side
effect of key/intent GC - to do better, we would start tracking the number of transaction records in
MVCCStatsand make a new
queue that only handles the transaction records - actually would likely want to track
sum(abort span, txn span)
Dealing with time constraints
The queues' processing of a single replica is currently time limited via context
cancellation. This doesn't allow introspection of the time remaining, which is
unfortunate. Instead, all queues' process method could accept an additional
timeout parameter (and preferably an unconstrained context). Most implementations
will just put the timeout on the context and work as usual, but the gc queue may
want to be smarter and leave sufficient time to "wrap up" before cancellation.
This may not be necessary if the GC queue stages all of its work in small chunks.