Skip to content

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

@nvb

Description

@nvb

Now that we've dropped kv.closed_timestamp.target_duration down to 3s, we've seen an increase in read-write mutations that get starved by the closed timestamp. The earlier understanding was that these transactions would only be bumped by the closed timestamp if they took over 3s to commit, and would then be given 3s to refresh and complete their final batch if not. This first part is true, but the second part was a misunderstanding.

If a transaction takes more than 3s it will be bumped up to the current closed timestamp's value. It will then hit a retry error on its EndTxn batch. It will refresh at the old closed timestamp value and then try to commit again. If this commit is again bumped by the closed timestamp, we can enter a refresh loop that will terminate after 5 attempts and then be kicked back to the client. Since we're only ever refreshing up to the current closed timestamp, the transaction actually only has kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction = 600ms to refresh and commit.

The easiest way to reproduce this is to perform a large DELETE statement in an implicit txn. If the DELETE takes long enough, it can get stuck in a starvation loop. This looks something like:

1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ..., EndTxn}
6. steps 4 and 5 took over 600ms so batch is pushed to `now - 3s`
7. retry for 5 times until giving up and either returning error to client or retrying in conn_executor 

This starvation is clearly undesirable, but it doesn't seem fundamental. There seems to be a few different things we could do here to improve the situation:

  1. if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.
  2. we noticed in storage: adaptive closed timestamp target duration #36478 that we don't need to respect the closed timestamp when re-writing intents. This means that we could avoid getting bumped by the closed timestamp in step 5 when retrying the batch of DELETEs. OR, we could revive an optimization we had previously removed and avoid issuing that prefix of the batch entirely after the refresh since it had already succeeded. We probably don't want to do this second part because it gets confusing and may not be as useful in the context of parallel commits.
  3. we also noticed in storage: adaptive closed timestamp target duration #36478 that EndTxn requests don't need to respect the closed timestamp. This means that we could ignore the closed timestamp entirely when issuing an EndTxn.

So in an optimal world, step 5 would be able to ignore the closed timestamp entirely as long as its intents were previously written in step 2.

Metadata

Metadata

Assignees

Labels

A-kv-transactionsRelating to MVCC and the transactional model.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions