kv: avoid thrashing and livelocking on multiple concurrent Range merge txns#59026
Conversation
48d6ac3 to
646f0d2
Compare
646f0d2 to
cc05807
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
thanks for updating the tech-note along with the change! I know this is blocked on the other PR, I'll get to that ASAP.
Reviewed 1 of 3 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
docs/tech-notes/range-merges.md, line 252 at r3 (raw file):
As of v21.1, the merge transaction uses a locking read on the LHS range's transaction record as its first course of action. This eliminates thrashing and livelock in the face of concurrent merge attempts. It also has the effect of
Do you think it's worth pointing out that we only expect these concurrent merge attempts in testing scenarios (right?)
pkg/kv/kvserver/replica_command.go, line 2273 at r3 (raw file):
// The method allows callers to specify whether a locking read should be used or // not. A locking read can be used to manage contention and avoid transaction // restarts in a read-modify-write operation (which all users of this method
Do you think the part in parenthesis could easily rot? If not, we could consider making this method always issue locking gets and remove the forUpdate option.
…e txns Fixes cockroachdb#46639. This commit updates the Range merge transaction to perform a locking read on the left-hand side's range descriptor. Locking the descriptor early (i.e. on the read instead of the write of the read-modify-write operation) helps prevent multiple concurrent Range merges from thrashing. Thrashing is especially detrimental for Range merges because any restart results in an abort, so thrashing can result in indefinite livelock. Before this change, I was easily able to reproduce the livelock discussed in cockroachdb#46639 by stressing `TestKVNemesisMultiNode` with the merge probability increased by 10x. When stressing on a 20 node roachprod cluster and using a 60s timeout, the test would typically fail after about 6,000 iterations. With this change and under the same setup, I have yet to see the failure over 125,000 iterations.
cc05807 to
c3f328e
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15)
docs/tech-notes/range-merges.md, line 252 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Do you think it's worth pointing out that we only expect these concurrent merge attempts in testing scenarios (right?)
Done.
pkg/kv/kvserver/replica_command.go, line 2273 at r3 (raw file):
Do you think the part in parenthesis could easily rot?
This is a pretty specialized function, so I don't imagine it being used for other things.
If not, we could consider making this method always issue locking gets and remove the forUpdate option.
That's a good question. I actually had this written without the forUpdate option at first, but I decided not to change the behavior of all transactions that use it at the same time, especially when we're in the migration period where the recipient might not even respect the locking field in the GetRequest. I do think we will eventually want to switch over to that, though. I added a TODO.
|
bors r+ |
|
Build succeeded: |
Fixes #46639.
This commit updates the Range merge transaction to perform a locking read on the left-hand side's range descriptor. Locking the descriptor early (i.e. on the read instead of the write of the read-modify-write operation) helps prevent multiple concurrent Range merges from thrashing. Thrashing is especially detrimental for Range merges because any restart results in an abort, so thrashing can result in indefinite livelock.
Before this change, I was easily able to reproduce the livelock discussed in #46639 by stressing
TestKVNemesisMultiNodewith the merge probability increased by 10x. When stressing on a 20 node roachprod cluster and using a 60s timeout, the test would typically fail after about 6,000 iterations. With this change and under the same setup, I have yet to see the failure over 125,000 iterations.