-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: readOnlyCmdMu not acquired by executeWriteBatch #46329
Description
#46234 noticed that unlike on the read-path (executeReadOnlyBatch), we don't synchronize with r.readOnlyCmdMu here. Is that ok? What if the replica is destroyed concurrently with a write? We won't be able to successfully propose as the lease will presumably have changed, but what if we hit an error during evaluation (e.g. a ConditionFailedError)?
@tbg:
I don't have a good answer here. If everything went through Raft, I think this code:
cockroach/pkg/kv/kvserver/replica_destroy.go
Lines 248 to 255 in 6c7b80e
for _, p := range r.mu.proposals { r.cleanupFailedProposalLocked(p) // NB: each proposal needs its own version of the error (i.e. don't try to // share the error across proposals). p.finishApplication(ctx, proposalResult{ Err: roachpb.NewError(roachpb.NewAmbiguousResultError("removing replica")), }) } would do it (assuming we check the destroy status under the same lock as we put new proposals in the map), but the failed CPut would skip the read. I think there's probably at least one bug here.
It's also striking how destroying the replica avoids the latch manager. You would assume that destroying the replica is something like
- set destroy status
- cancel all open proposals
- grab an "everything r/w" latch
- delete things
but this isn't how it works.
I look forward to when we unify the rw and ro paths, at which point we'll be forced to confront these head on.
It's also striking how destroying the replica avoids the latch manager.
Yeah, I was wondering about this too. It would be very nice to flush out all requests using the latch manager, but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches. Maybe that's just FUD.
@tbg:
but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.
I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.