Skip to content

storage: BeginTxnRequest can be delayed long enough to expire #23945

@nvb

Description

@nvb

One theory explaining what was going wrong in #20448 led me to realize that it is possible for a BeginTransactionRequest to create a transaction record that is immediately aborted after it is written. This is because transaction records can be aborted if they are found to be expired. The expiration check looks at the transaction record's LastActive timestamp. This timestamp is the max of the transaction's OrigTimestamp and its LastHeartbeat timestamp.

The problem arises when a batch containing a BeginTxnRequest is delayed. This can happen for any number of reasons, but one especially interesting case is when it is delayed because a later write in the batch hits a WriteIntentError. In this case, the BeginTxnRequest may end up waiting in the txnwait queue for other transactions to finish. In a highly contended scenario, it may end up waiting in this queue multiple times for multiple other transactions. If it waits for too long then by the time it finally succeeds in writing its transaction record, it may already look like its expired. This opens up a window for other transactions to abort it before it can reach its transaction coordinator and begin its heartbeat loop. In a highly contended workload, this could theoretically spiral out of control and block all forward progress.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions