storage: check leader not leaseholder condition on all commands#12323
storage: check leader not leaseholder condition on all commands#12323spencerkimball merged 1 commit intomasterfrom
Conversation
|
Reviewed 4 of 4 files at r1. pkg/storage/client_raft_test.go, line 2917 at r1 (raw file):
while you're here pkg/storage/client_raft_test.go, line 2921 at r1 (raw file):
this is no longer testing the mechanism by which raft leadership transfers happen, right? seems odd to see this code here now that transfers are triggered by all commands. on second thought, maybe it is just that this comment is outdated. pkg/storage/client_raft_test.go, line 2943 at r1 (raw file):
nit: pkg/storage/client_raft_test.go, line 2963 at r1 (raw file):
nit: pkg/storage/client_raft_test.go, line 2964 at r1 (raw file):
nit: this pkg/storage/client_raft_test.go, line 2982 at r1 (raw file):
nit: pkg/storage/helpers_test.go, line 264 at r1 (raw file):
this function is very similar to pkg/storage/replica.go, line 3473 at r1 (raw file):
why the nil check? pkg/storage/replica_proposal.go, line 430 at r1 (raw file):
while you're here, log the whole target replica? Comments from Reviewable |
|
In that past when this came up there was concern about attempting too many of these raft transfers in case they're constantly failing. I don't understand how these transfers work and under which conditions they might fail, but maybe some throttling is in order? Btw, @ben, @petermattis, is it time to give https://reviewable.io/reviews/cockroachdb/cockroach/8837 another look? It was trying to address #8816 by transferring raft leadership before transferring leases. Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 10 unresolved discussions, all commit checks successful. pkg/storage/client_raft_test.go, line 2917 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Right: pkg/storage/replica_proposal.go, line 429 at r1 (raw file):
Can we do the lease check first? Comments from Reviewable |
|
The concern about leadership transfers failing was partially FUD, and I'm less concerned about it now since this appears to have been working reliably so far. Transfers can fail if network messages get reordered, but that shouldn't happen with our raft transport. We should have a metric for how often these transfers are attempted, but I don't think we need to build throttling unless we see a problem. (if anything, throttling could mask a problem by allowing a range to limp along with limited availability; without throttling the problem will be more visible). The combination of this change and #8837 would be bad: the node that receives the leadership transfer under #8837 would be likely to transfer it right back to the former leader under this change (and I think that's more of an argument against #8837 than against this change). I'll take a look at #8837 and #8816. The commit message should be updated to say "on all write commands"; read-only traffic won't trigger this check. Should it? It doesn't matter in practice since a stable range with read-only traffic doesn't really care whether the lease holder and raft leader are the same, but it does mean that the leader-not-leaseholder metric wouldn't reliably fall to zero. (for that matter, the range could be completely idle. Should we have a scanner queue to perform this transfer as a time-based last resort?) Reviewed 4 of 4 files at r1. pkg/storage/replica_proposal.go, line 431 at r1 (raw file):
We should increment a metric here. Comments from Reviewable |
|
I've changed this to instead tee these up during metrics computation when we realize the condition exists. Otherwise, I keep seeing small numbers of long-running replicas with the leader-not-leaseholder condition. Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful. pkg/storage/client_raft_test.go, line 2917 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. pkg/storage/client_raft_test.go, line 2921 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I've mostly restored this unittest as I decided to make the transfers happen during metrics computation instead. pkg/storage/client_raft_test.go, line 2943 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/client_raft_test.go, line 2963 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/client_raft_test.go, line 2964 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/client_raft_test.go, line 2982 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/helpers_test.go, line 264 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
It's simplified now. pkg/storage/replica.go, line 3473 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
This is gone now. I was avoiding doing making this check after Raft leadership transfers, which always send a nil request. It made testing impossible. pkg/storage/replica_proposal.go, line 429 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. pkg/storage/replica_proposal.go, line 430 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I've reverted to old code here. pkg/storage/replica_proposal.go, line 431 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
We already have a Comments from Reviewable |
55df017 to
854703f
Compare
|
Review status: 0 of 4 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 4472 at r2 (raw file):
It's very strange for a method named Comments from Reviewable |
|
Reviewed 4 of 4 files at r2. pkg/storage/client_raft_test.go, line 2966 at r2 (raw file):
this "1" is still hard coded Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. pkg/storage/client_raft_test.go, line 2966 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica.go, line 4472 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
854703f to
9e4e348
Compare
|
This PR causes the cluster to lock up. Need to investigate why, but my suspicion is there's something happening in |
|
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. Comments from Reviewable |
|
Reviewed 4 of 4 files at r3. pkg/storage/client_raft_test.go, line 2944 at r3 (raw file):
bunch more hardcoded IDs here, but not a blocker. Comments from Reviewable |
a8d9e31 to
ca91356
Compare
|
Review status: 2 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. pkg/storage/client_raft_test.go, line 2944 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
OK Comments from Reviewable |
|
Reviewed 2 of 2 files at r4. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):
this "1" (and on the next line) is rd0.ReplicaID. should be easy now that you've moved the definition of pkg/storage/replica_proposal.go, line 439 at r4 (raw file):
did this solve the cluster lockup problem? Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. pkg/storage/replica_proposal.go, line 440 at r4 (raw file):
Unless I'm misreading, the check in the raft code is Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica_proposal.go, line 439 at r4 (raw file):
s/leaseholder/target/ - only the caller knows that the target is the leaseholder Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_proposal.go, line 439 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
No. Still looking. This PR will continue to change. pkg/storage/replica_proposal.go, line 439 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_proposal.go, line 440 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
You're misreading. The Comments from Reviewable |
|
My guess is that quiescence is preventing a leadership transfer from being canceled in https://github.com/coreos/etcd/blob/d8513adf1d2058047929b1ed7c038fbe023ebe3a/raft/raft.go#L548 (but heartbeats are still flowing and preventing anyone but our designated target from campaigning). This could occur whenever we transfer leadership to a node that is going away, but this was unlikely when we only transferred to a node that successfully proposed and acquired a lease. Here, it's possible that we're transferring leadership to a node just as its replica is being removed. I think we need to prevent quiescence when a leader transfer is in progress. Reviewed 2 of 4 files at r3, 2 of 2 files at r4. pkg/storage/replica_proposal.go, line 431 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Comments from Reviewable |
|
@bdarnell I'm not following how your scenario could occur. The leaseholder is the one which performs replica rebalancing for a range and it will never remove itself from the range. Instead, it first transfers its lease somewhere else and the recipient will perform the removal. I suppose it is possible for this lease transfer and removal to happen fast enough the the raft leadership transfer overlaps, but that seems unlikely. That said, I'll play around with some scenarios here as clearly we're seeing something on Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. Comments from Reviewable |
|
Maybe it's racing with an addition instead of a removal, then: We see that the local node is overloaded while another node is underloaded. We rebalance a replica onto it, then as soon as the ChangeReplicas commits and we've processed it locally (but the target node hasn't), we transfer the lease, and then the raft leadership, all before that node has applied the EndTransaction that adds it (so it drops the raft MsgTimeoutNow message). The timing on that is really tight, but it could be possible, and that would have the same quiescence problem. There might be some quirk of preemptive snapshot processing (and its aftermath) that makes the timing more likely. |
|
Damn the problem disappeared after the run where I changed |
|
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. pkg/storage/replica_proposal.go, line 431 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
OK, added a metric. Comments from Reviewable |
ca91356 to
e1c7af6
Compare
|
Reviewed 3 of 3 files at r5. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
? pkg/storage/replica_proposal.go, line 439 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
? Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Damn, my changes here got reverted accidentally. pkg/storage/replica_proposal.go, line 439 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
58e9908 to
0f200dd
Compare
|
Reviewed 2 of 2 files at r6. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
the error message below still says "1", but lg otherwise Comments from Reviewable |
0f200dd to
288b8a0
Compare
|
@petermattis take a look at the most recent change. I've decided to check for the condition at |
|
Review status: 1 of 5 files reviewed at latest revision, 2 unresolved discussions. pkg/storage/client_raft_test.go, line 2930 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
Previously, we only checked for leader-not-leaseholder when completing a lease operation. This relied on the constant renewals of expiration-based leases to transfer Raft leadership reliably. We now do this while checking if a range can be quiesced. This is convenient because we want to also avoid quiescing if we notice this problem. Also, only attempt to transfer raft leader to the leaseholder if the leaseholder (the transferee) has fully applied the raft log. Otherwise, the leader will send the presumably unapplied log suffix to the transferee again and again.
288b8a0 to
647008c
Compare
|
Reviewed 1 of 3 files at r5, 5 of 5 files at r7. Comments from Reviewable |
Previously, we only checked for leader-not-leaseholder when completing a
lease operation. This relied on the constant renewals of expiration-based
leases to transfer Raft leadership reliably.
This change is