docs: RFC loss of quorum recovery#72527
Conversation
76cdc3c to
21473d5
Compare
bdarnell
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 31 at r1 (raw file):
* **(AllLive)** Any Replica found on any store is live (i.e. not waiting for replicaGC) * **(Covering)** There is a replica for each range that needs recovery, i.e. we can cover the entire keyspace. * **(DeterministicRangeDescriptor)** All Replicas of a given Range (across multiple stores) have the same view of the range descriptor.
s/Deterministic/Consistent/g?
I believe we've used this tool in cases where this assumption and RaftLog were violated. We had to manually determine that the stale range descriptor would be harmless once modified. I can't remember the details, though.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
We pointed out above that in the status quo, lots of assumptions are made about the state being consistent across the various stores, which allowed for local decision-making. This assumption is obviously not going to hold in general, and the possible problems arising from this make us anxious about proliferation of usage of the tool. We propose to add an explicit distributed data collection step that allows picking the “best” designated survivor and that also catches classes of situations in which the tool could cause undesired behavior. When such a situation is detected, the tool would refuse to work, and an override would have to be provided; one would be in truly unchartered territory in this case, with no guarantees given whatsoever. Over time, we can then evolve the tool to handle more corner cases. The distributed data collection step is initially an offline command, i.e. a command `./cockroach debug recover dump-info <store>` is run directly against the store directories. Much of the current code in `unsafe-remove-dead-replicas` is reused: an iteration over all range descriptors takes place, but now information about all of the replicas is written to a file.
We should have the k8s operator team review this process to make sure it's feasible in that environment.
Can this command be run concurrently with the server process or must it be stopped first? If the server process is stopped (or has crashed on its own), is there any difficulty in accessing the right persistent volume? This tool would need access to the data keys if the store is encrypted.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 69 at r1 (raw file):
Note that this step also eliminates duplicate survivors whose range descriptor does not agree. For example, if we have a replica r1=[a,z)@gen5 and r1=[a,z)@gen7, we’ll eliminate the first. Note that this is an example of a general class of scenarios in which a split happened that the survivors don’t know about; in this case the right-hand side might exist in the cluster (and may not even need recovery). We thus detect this and will fail below, due to a hole in the keyspace. This step ensures that *(AllLive)* and *(DeterministicRangeDescriptor)* are true for the remaining set of replicas. * For all ranges that have multiple replicas surviving, pick the one with the highest RaftCommittedIndex. (See “Follow-up Work” for a comment on this)
s/the one/one/
In the event of a tied RaftCommittedIndex, I assume we'd pick the highest StoreID, same as today.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 76 at r1 (raw file):
Note that choosing the highest CommittedIndex isn’t necessarily the best option. If we are to determine with full accuracy the most up-to-date replica, we need to either perform an offline version of Raft log reconciliation (which requires O(len-unapplied-raft-log) of space in the data collection in the worst case, i.e. a non-starter) or we need to resuscitate multiple survivors per range if available (but this raises questions about nondeterminism imparted by the in-place rewriting of the RangeDescriptor). Conceptually it would be desirable to replace the rewrite step with an “append” to the RaftLog. (This isn’t always the best option, there may be another replica that has more entries in its log, but doesn’t know that they are committed, but it’s a good
This paragraph seems redundant with the previous one (or if it's making a different point, the difference is subtle and I'm missing it).
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 103 at r1 (raw file):
Once unsafe recovery tool is readily available, operators could start using it when not appropriate if it proves successful. As data inconsistency will not always happen and that would give a sense of false safety. ## Follow-up work
We should produce a "pretty-printed" report of what data is potentially inconsistent after the recovery (i.e. "Table foo.bar, with keys between (X, Y)"). When we performed inconsistent recovery on a secondary index but not the corresponding primary, we could recommend dropping that index and recreating it (or even do so automatically).
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @bdarnell, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
We should have the k8s operator team review this process to make sure it's feasible in that environment.
Can this command be run concurrently with the server process or must it be stopped first? If the server process is stopped (or has crashed on its own), is there any difficulty in accessing the right persistent volume? This tool would need access to the data keys if the store is encrypted.
Is this something we would even want to automate via the K8s operator? I think I'd wait for the half-online approach, which is less involved, and fall back to manual recovery in the meanwhile. But yeah, would welcome their input on this.
Manually running this would probably involve stopping the statefulset and starting new pods using the same PVCs as the statefulset, running a one-off command that dumps the recovery info via stdout. This could get fiddly, so we could consider including a small script that fetches the statefulset manifest from the K8s API and generates appropriate pod manifests.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 105 at r1 (raw file):
## Follow-up work As a second step, we can make the tool fully-online, i.e. automate the data collection and recovery step behind a cli/API endpoint, thus providing a "panic button" that will try to get the cluster back to an available (albeit possibly userdata-inconsistent) state from which a new cluster can be seeded. This must not use most KV primitives (to avoid relying on any parts of the system that are unavailable), instead we need to operate at the level of Gossip node membership (for routing) and special RPC endpoints. The range report (ranges endpoint) should be sufficient for the data collection phase with some hardening and possibly inclusion of some additional pieces of information. The active recovery step can be carried out by a store-addressed recovery RPC that instructs the store to eject the in-memory incarnations of the designated-survivor-replicas from the Store, replacing them with placeholders owned by the recovery RPC, which will thus have free reign over the portion of the engines owned by these replicas (and it can thus carry out the same operations unsafe-remove-dead-replicas would).
nit: "As a second step" isn't accurate, since we also have "Step 2" above. Similarly, we refer to "stage four" below. I'd just drop the numeration.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 117 at r1 (raw file):
start; going by length of log alone could also pick the “wrong” survivor because of term changes; doing the right thing requires comparing the terms of the logs which is too much data to collect). ### TODO
Consider giving this a more descriptive heading.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 122 at r1 (raw file):
## Rationale and Alternatives
Are we planning to add something here? If not, let's just remove the section.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
# Unresolved questions Best way to collect information about replicas in half-online and online approach - should cli fan-out to all nodes as instructed or should it delegate to a single node to perform fan-out operations. The tradeoff is cli fan-out will require operator to provide all node addresses to guarantee collection, but that would work in all cases. Maybe we can do both approaches with cli fan-out being a fallback if we can't collect from all nodes.
I think we'll need multiple fallbacks here, so it may make sense for the CLI to do the actual fanout to avoid having to duplicate this logic both in the CLI and on the nodes. This then becomes a matter of node discovery, which would be:
- If the KV liveness range is available, collect nodes from there.
- Otherwise, crawl gossip by collecting gossip liveness entries from the connected node, and then connect to those nodes and do the same recursively.
- Otherwise, allow the operator to explicitly list all nodes via the CLI.
We should probably show a summary of the discovered nodes and ask the operator for confirmation before proceeding, at least in the case of 2.
udnay
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is this something we would even want to automate via the K8s operator? I think I'd wait for the half-online approach, which is less involved, and fall back to manual recovery in the meanwhile. But yeah, would welcome their input on this.
Manually running this would probably involve stopping the statefulset and starting new pods using the same PVCs as the statefulset, running a one-off command that dumps the recovery info via stdout. This could get fiddly, so we could consider including a small script that fetches the statefulset manifest from the K8s API and generates appropriate pod manifests.
@chrisseto can you answer the questions above? I'm not sure if a pod can take over a bound PVC. Maybe we could keep the existing pods but change the container's entry point to either the proposed command or some sort of busybox workload so that we can run the command. Keeping the STS around and managing the pods seems like a good idea if possible but I'm not sure.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think we'll need multiple fallbacks here, so it may make sense for the CLI to do the actual fanout to avoid having to duplicate this logic both in the CLI and on the nodes. This then becomes a matter of node discovery, which would be:
- If the KV liveness range is available, collect nodes from there.
- Otherwise, crawl gossip by collecting gossip liveness entries from the connected node, and then connect to those nodes and do the same recursively.
- Otherwise, allow the operator to explicitly list all nodes via the CLI.
We should probably show a summary of the discovered nodes and ask the operator for confirmation before proceeding, at least in the case of 2.
For the k8s operator what signal can we use to determine if we've irrecoverably lost quorum? The steps described seem involved and maybe not best to automate them right away but we should be able to suspend or make a cluster go offline if we detect that it is in the state that described here. How can we prevent false positives so we don't take down a healthy cluster?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @mwang1026, @tbg, and @udnay)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
Previously, udnay (Yandu Oppacher) wrote…
@chrisseto can you answer the questions above? I'm not sure if a pod can take over a bound PVC. Maybe we could keep the existing pods but change the container's entry point to either the proposed command or some sort of busybox workload so that we can run the command. Keeping the STS around and managing the pods seems like a good idea if possible but I'm not sure.
Right, couldn't remember if the pods were kept around when stopped. You're right that it'd be much better to just reuse the existing stopped pods and start them with an explicit command/entrypoint.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, udnay (Yandu Oppacher) wrote…
For the k8s operator what signal can we use to determine if we've irrecoverably lost quorum? The steps described seem involved and maybe not best to automate them right away but we should be able to suspend or make a cluster go offline if we detect that it is in the state that described here. How can we prevent false positives so we don't take down a healthy cluster?
I'd be very wary of automatically triggering this. It can result in irrecoverable data loss and broken invariants, and I think a human needs to make a judgement call as to whether restoring from backup or attempting loss of quorum recovery is the better option. It's also going to be hard to detect whether the quorum loss is temporary or permanent -- e.g. in the case of a network outage, if a human knows that the network is coming back online within an hour, it's probably better to wait for that rather than triggering LoQ recovery. Furthermore, automation could end up doing LoQ recovery independently on multiple sides of a network partition, and then you have a classic split brain scenario.
This tooling should really be a last resort, not something we rely on for normal operation.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @mwang1026, @tbg, and @udnay)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
we should be able to suspend or make a cluster go offline if we detect that it is in the state that described here. How can we prevent false positives so we don't take down a healthy cluster?
To answer your actual question (🙂), I don't think we currently have a good way to detect it, but see #61118. We're currently working on some related functionality (circuitbreakers for unavailable ranges) that needs this detection and should allow us to expose a metric for it pretty easily. But we can't know whether it's irrecoverable or not, because we can't know why the unavailable nodes are unavailable (are they just offline, or did they catch fire?).
However, even if we lose quorum on some ranges, would we want to shut down the cluster? In >3 node clusters there would presumably still be ranges that have quorum, wouldn't it be better to keep those online and serve whatever traffic we can?
aliher1911
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
we should be able to suspend or make a cluster go offline if we detect that it is in the state that described here. How can we prevent false positives so we don't take down a healthy cluster?
To answer your actual question (🙂), I don't think we currently have a good way to detect it, but see #61118. We're currently working on some related functionality (circuitbreakers for unavailable ranges) that needs this detection and should allow us to expose a metric for it pretty easily. But we can't know whether it's irrecoverable or not, because we can't know why the unavailable nodes are unavailable (are they just offline, or did they catch fire?).
However, even if we lose quorum on some ranges, would we want to shut down the cluster? In >3 node clusters there would presumably still be ranges that have quorum, wouldn't it be better to keep those online and serve whatever traffic we can?
We definitely not looking on automating failure detection now and even if we could do it, invoking recovery should be a judgement call to accept the data loss and possible data inconsistency and proceed. If we lost quorum, we don't know if remaining replica(s) have up to date information and in case of active writes to that range last operations might be lost.
bdarnell
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Right, couldn't remember if the pods were kept around when stopped. You're right that it'd be much better to just reuse the existing stopped pods and start them with an explicit command/entrypoint.
We definitely wouldn't want to automate this (in the sense of the operator deciding to take this risky step on its own). My question was basically about the feasibility of running this on an operator-managed cluster, regardless of how much the operator itself is involved. (I think unless the half-online mode is imminent we'd still want to put some of the scripting in the operator to run everything on the right volumes and collect results).
udnay
left a comment
There was a problem hiding this comment.
Reviewed all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @chrisseto, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
The operator would need to know that it needs to manage the cluster in a different way as the cluster is in a broken state but it certainly can be made to work here. I think this is true whether the operator is involved or not with Kubernetes resources trying to converge on a stable state.
We definitely wouldn't want to automate this (in the sense of the operator deciding to take this risky step on its own).
I agree, but we also need to make sure that CRDB can give the operator a signal to not do any tasks currently or try to fix the cluster by restarting pods or nodes. i.e. CRDB signals that it is in a bad state and any pending or executing workflows should be paused/canceled until it's been resolved.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, aliher1911 (Oleg) wrote…
We definitely not looking on automating failure detection now and even if we could do it, invoking recovery should be a judgement call to accept the data loss and possible data inconsistency and proceed. If we lost quorum, we don't know if remaining replica(s) have up to date information and in case of active writes to that range last operations might be lost.
Right, this had come up before where we can have under-replicated ranges because of a node restart and the operator needs a way of determining that CRDB is in a state where the workflow can move to restarting the next node. In CC we just wait a certain period of time before moving to the next node (3min after the readiness probe succeeds) in the self-hosted operator we check each node/pod for under replicated ranges as both extra guard rails and to short circuit the timeout for large clusters.
Ideally CRDB would expose a single endpoint that would know how to query all of the nodes for us vs the operator needing to do that bit of business logic.
chrisseto
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @bdarnell, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
Previously, udnay (Yandu Oppacher) wrote…
The operator would need to know that it needs to manage the cluster in a different way as the cluster is in a broken state but it certainly can be made to work here. I think this is true whether the operator is involved or not with Kubernetes resources trying to converge on a stable state.
We definitely wouldn't want to automate this (in the sense of the operator deciding to take this risky step on its own).
I agree, but we also need to make sure that CRDB can give the operator a signal to not do any tasks currently or try to fix the cluster by restarting pods or nodes. i.e. CRDB signals that it is in a bad state and any pending or executing workflows should be paused/canceled until it's been resolved.
Any pod can take over a PV{,C} if it's not currently bound to something. Provided that one is recovering a cluster in a single region, it would be fairly easy to provide a PodSpec that mounts all stores under a shared volume. Executing in a multi-region/multi-kubernetes environment would be a bit more difficult as the user would need to extract the results of dump-info to a centralized location before proceeding but we could still produce PodSpecs per region and take all the nodes offline, if that's desirable.
It might be a desirable feature in general to have a "recovery" mode that would take specific nodes offline and mount their volumes to a pod that has the cockroach binary but has an entrypoint of sleep 999999. This would allow human operators to exec in and perform any operations that might be need, be it this, manual compaction, other strange recovery techniques that we have.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 87 at r1 (raw file):
The stop-the-world approach can be unnecessarily intrusive. The cluster doesn't have to be stopped for the full duration of all the recovery stages, It only needs to be stopped to perform apply a change to the store once unavailable replicas are identified and recovery ops planned. Moreover, since ranges that lost quorum can't progress anyway, update could be performed in a rolling fashion.
nit: s/to perform apply/to apply/?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @bdarnell, @mwang1026, @tbg, and @udnay)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
It might be a desirable feature in general to have a "recovery" mode that would take specific nodes offline and mount their volumes to a pod that has the cockroach binary but has an entrypoint of sleep 999999. This would allow human operators to exec in and perform any operations that might be need, be it this, manual compaction, other strange recovery techniques that we have.
This seems like a good idea.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, udnay (Yandu Oppacher) wrote…
Right, this had come up before where we can have under-replicated ranges because of a node restart and the operator needs a way of determining that CRDB is in a state where the workflow can move to restarting the next node. In CC we just wait a certain period of time before moving to the next node (3min after the readiness probe succeeds) in the self-hosted operator we check each node/pod for under replicated ranges as both extra guard rails and to short circuit the timeout for large clusters.
Ideally CRDB would expose a single endpoint that would know how to query all of the nodes for us vs the operator needing to do that bit of business logic.
I see. Well, I think we can provide a signal for unavailable ranges (generally due to loss of quorum) at least, with #61118. Integrating that into a single endpoint seems like a broader discussion that the server team should own, but it definitely seems doable.
aliher1911
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 31 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/Deterministic/Consistent/g?
I believe we've used this tool in cases where this assumption and RaftLog were violated. We had to manually determine that the stale range descriptor would be harmless once modified. I can't remember the details, though.
Done.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 52 at r1 (raw file):
The operator would need to know that it needs to manage the cluster in a different way as the cluster is in a broken state but it certainly can be made to work here. I think this is true whether the operator is involved or not with Kubernetes resources trying to converge on a stable state.
We could theoretically add some "property" to the cluster that indicates that we decided to proceed with recovery. I'm not sure if we currently have any state like that. Otherwise we don't want to stop restarting and potentially making even more data unavailable as a result of losing some of ranges.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 69 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/the one/one/
In the event of a tied RaftCommittedIndex, I assume we'd pick the highest StoreID, same as today.
Done.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 76 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This paragraph seems redundant with the previous one (or if it's making a different point, the difference is subtle and I'm missing it).
Looks like this one leaked form the follow up work section which discusses what could and shouldn't be done in some detail.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 103 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
We should produce a "pretty-printed" report of what data is potentially inconsistent after the recovery (i.e. "Table foo.bar, with keys between (X, Y)"). When we performed inconsistent recovery on a secondary index but not the corresponding primary, we could recommend dropping that index and recreating it (or even do so automatically).
Done.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 117 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Consider giving this a more descriptive heading.
Done.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 122 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Are we planning to add something here? If not, let's just remove the section.
Done.
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I see. Well, I think we can provide a signal for unavailable ranges (generally due to loss of quorum) at least, with #61118. Integrating that into a single endpoint seems like a broader discussion that the server team should own, but it definitely seems doable.
I was thinking about it and the case looks slightly different. We don't want to stop restarting nodes if we have unavailable ranges in normal case. We only want to block that if we started fixing the store and want a safeguard against data corruption.
21473d5 to
040bdba
Compare
udnay
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @erikgrinaker, @mwang1026, and @tbg)
docs/RFCS/20211103_loss_of_quorum_recovery.md, line 134 at r1 (raw file):
Previously, aliher1911 (Oleg) wrote…
I was thinking about it and the case looks slightly different. We don't want to stop restarting nodes if we have unavailable ranges in normal case. We only want to block that if we started fixing the store and want a safeguard against data corruption.
I think the operator doing anything once quorum is lost won't be safe and is out of scope for 22.1. Like @erikgrinaker this is most likely a larger discussion and the endpoint would be checked during normal operations like cert renewal or upgrades not in this recovery case.
|
@aliher1911 should we merge this? |
040bdba to
78855f8
Compare
|
Had a final go through this to ensure current impl doesn't contradict the rfc. Looks accurate with regards on what is done and what we want to get further. We should merge this. |
tbg
left a comment
There was a problem hiding this comment.
Change status from draft before merging :-)
RFC describes proposed approach to recovering data ranges that lost consensus in corresponding raft groups because of permanent loss of too many replicas. Release note: None
78855f8 to
4179a8f
Compare
|
bors r+ |
|
Build succeeded: |
RFC describes proposed approach to recovering data ranges that
lost consensus in corresponding raft groups because of permanent
loss of too many replicas.
Release note: None
Jira issue: CRDB-14698