-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: allow multiple intents #5861
Description
From an email thread with Andrew Kimball.
Andrew:
I've been taking a look at the Yabandeh paper and thinking about how it might affect the Cockroach algorithm. The paper does a good job explaining why WSI (write snapshot-isolation) is a useful concept, and why it guarantees serializable semantics. However, the paper argues that WSI can be as efficient as SI for real-world workloads. I feel skeptical about this claim, because SI only needs to abort transactions on write-write conflicts, whereas WSI needs to abort on read-write conflicts. Since read-write conflicts are more common, WSI is going to abort more.
I confess that I only skimmed the section in the paper about their implementation, as they use a centralized "status oracle". I don't think that's a good idea in a system like Cockroach, which should avoid single points of failure. Even if the status oracle consists of several "failure-resistant" machines linked by consensus, there are still delays during re-configuration that would affect all ongoing transactions. A large-scale distributed system should be built to gracefully degrade on failures rather than going immediately from 100% => 0%, as in that case. Also, in the case of nodes distributed over a wide geographic area (i.e. the globe), a central machine is going to greatly increase the overall latency of the system, as many nodes will be hundreds of milliseconds away from it. If the ranges I'm committing to are nearby, and if I find no conflicts, then I should be able to commit immediately, with no coordination necessary with machines elsewhere in the world.
I therefore conclude that this paper does not undermine the algorithm used by Cockroach for SI. I think it's still very much worthwhile to support SI, as there should be many fewer aborts in important real-world workloads. However, I do think that Cockroach's serializable algorithm could be improved, based on the concepts from the paper. Hopefully my knowledge of Cockroach's serializable algorithm is not too far out-of-date, otherwise this might not make much sense.
In particular, there is no need to check for write-write conflicts in serializable mode. So, if a txn is trying to lay down a write intent, it could simply lay it down right beside another write intent, rather than aborting the other txn. This is true as long as the candidate commit time of the txn is > the last time that row may have been read. For example, using the notation from the paper:
r1[x] w2[x] w1[x] c2 c1
In this case, since no conflict was registered (since write-write conflicts are ignored), txn1's commit timestamp would not be pushed forward in logical time, and so would simply equal the starting timestamp: Ts(txn1) = Tc(txn1). The same situation would be true for txn2. The equivalent serial history would be (assuming Ts(txn1) < Ts(txn2)):
r1[x] w1[x] c1 w2[x] c2
Happily, Cockroach's current serializable rules already prevent read-write conflicts. When it comes time to commit a txn, Cockroach already checks if the txn's commit time is different than its start time. If it is different, then it's possible that it may have read an older version of some value that another txn later modified. I believe Cockroach simply aborts the txn in that case. However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.
One other interesting point. Consider History 6 in the paper, which will force an abort due to the rules of WSI:
H6: r1[x] r2[z] w2[x] w1[y] c2 c1
This is actually a serializable history, and does not need to be aborted:
H7: r1[x] w1[y] c1 r2[z] w2[x] c2
The rules that Cockroach use would not abort this. Here is the sequence:
- txn1 reads x at Ts(txn1).
- txn2 reads z at Ts(txn2), where we'll assume that Ts(txn2) > Ts(txn1).
- txn2 writes x at Ts(txn2). Because the timestamp of the write intent is > the timestamp of the last read of that row, no conflict occurs.
- txn1 writes y at Ts(txn1).
- txn2 tries to commit. Because Tc(txn2) = Ts(txn2), it commits at logical time Ts(txn2).
- txn1 tries to commit. Because Tc(txn1) = Ts(txn1), it commits at logical time Ts(txn1).
Thus we have arrived at the serializable history H7 with no abort. So while the Cockroach algorithm aborts cases that Yabandeh would not, there are cases where the reverse is true. Furthermore, Cockroach could avoid the other aborts if it remembered a transaction's read-set, as I suggest above as a possibility.
~Andy
P.S. I sometimes like to use an alternate history representation which factors in logical time as well as absolute (wall-clock) time:
t0: r1[x] w1[y] c1
t1: r2[z] w2[x] c2
In this notation, absolute time reads from left to right, and logical time reads from top to bottom. This makes it more clear that txn1 commits after txn2 in absolute time, but before it in logical time.
Jira issue: CRDB-6193