storage: reduce commit index on quiescence heartbeat, don't drop them#31112
storage: reduce commit index on quiescence heartbeat, don't drop them#31112craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
Informs cockroachdb#30064. 53dc851 introduced logic to drop quiescence heartbeats to Replicas that are not caught up to the leader. This turns out to be an issue in cases where: - a Range is quiescent - a Replica needs a heartbeat to join a Raft group - that Replica's node is unable to update its liveness record because doing so depends on the Replica joining the Raft group - so the Replica never gets any heartbeats and never joins the group This change adjusts the logic. Instead of dropping these heartbeats, the leader Replica now sends them, but with a reduced commit index and with Quiesce equal to false. Release note: None
Release note: None
|
I don't like that this doesn't have a test, but from what I can tell, we don't have any tests around replica quiescence at all and a one-off test for this specific issue is going to be very hard to write. At a minimum, this seems to improve the situation with #30064. Any suggestions are welcome. |
bdarnell
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 4886 at r2 (raw file):
if !r.maybeCoalesceHeartbeat(ctx, msg, toReplica, fromReplica, quiesce) { log.Fatalf(ctx, "failed to coalesce known heartbeat: %v", msg)
What does "known" mean here? Why is this now a fatal instead of returning false?
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 4886 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
What does "known" mean here? Why is this now a fatal instead of returning false?
maybeCoalesceHeartbeat only returns false if it is given a non-heartbeat request. This is so that it's easy to use in Replica.sendRaftMessage. In this case, we know we're giving it a heartbeat request, so it should never return false.
bdarnell
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
|
bors r+ |
31112: storage: reduce commit index on quiescence heartbeat, don't drop them r=nvanbenschoten a=nvanbenschoten Informs #30064. 53dc851 introduced logic to drop quiescence heartbeats to Replicas that are not caught up to the leader. This turns out to be an issue in cases where: - a Range is quiescent - a Replica needs a heartbeat to join a Raft group - that Replica's node is unable to update its liveness record because doing so depends on the Replica joining the Raft group - so the Replica never gets any heartbeats and never joins the group This change adjusts the logic. Instead of dropping these heartbeats, the leader Replica now sends them, but with a reduced commit index and with Quiesce equal to false. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
Informs #30064.
53dc851 introduced logic to drop quiescence heartbeats to Replicas that
are not caught up to the leader. This turns out to be an issue in cases
where:
doing so depends on the Replica joining the Raft group
This change adjusts the logic. Instead of dropping these heartbeats, the
leader Replica now sends them, but with a reduced commit index and with
Quiesce equal to false.