TQ: Add protocol support for LRTQ upgrade#9065
Conversation
0e4ea0a to
f9e3eea
Compare
This PR completes the first version of the sans-io trust quorum protocol implementation. LRTQ upgrade can now be started via `Node::coordinate_upgrade_from_lrtq`. This triggers the coordinating node to start collecting the LRTQ key shares so that they can be used to construct the LRTQ rack secret via the bootstore code. After this occurs, a Prepare message is sent out with this old rack secret encrypted in a manner identical to a normal reconfiguration. The prepare and commit paths remain the same. The cluster proptest was updated to sometimes start out with an existing LRTQ configuration and then to upgrade from there. Like normal reconfigurations it allows aborting and pre-empting of the LRTQ upgrade with a new attempt at a higher epoch. In production this is how we "retry" if the coordinating node crashes prior to commit, or more accurately, if nexus can't talk to the coordinating node for some period of time and just moves on. After the LRTQ upgrade commits, normal reconfigurations are run. We also remove unnecessary config related messages in this commit. Since a `Configuration` does not contain sensitive information it can be retrieved when Nexus polls the coordinator before it commits. Then Nexus can save this info and send it in `PrepareAndCommit` messages rather than having the receiving node try to find a live peer with the config prior to collecting shares. This is a nice optimization that reduces protocol complexity a bit. This removal allowed removing the TODO in the message `match` statement in `Node::handle` and completing the protocol.
f9e3eea to
f2540b9
Compare
plotnick
left a comment
There was a problem hiding this comment.
This looks correct to me, within my limited knowledge of the protocol details. Very exciting to be able to upgrade out of LRTQ!
One general thing I did notice (not just in new code) was a number of places where errors are logged but not propagated up. In these (admittedly unlikely, most of them) cases, how would an operator know something is wrong or where to look? Maybe not worth addressing, but definitely gave me pause.
| error!( | ||
| self.log, | ||
| "Failed to reconstruct LRTQ rack secret"; | ||
| "err" => err.to_string() | ||
| ); | ||
| return; |
There was a problem hiding this comment.
How would we know something's gone wrong here aside from checking the log? Shouldn't we propagate errors like this up somehow?
|
Thanks for taking a look @plotnick!
This is a great question. The short answer is that the operator wouldn't know the specific error unless they looked at the logs/support bundle. However, they would be able to see which nodes have not acked or committed when Nexus is up and participating, because the plan is to have an RPW that runs to poll status from the nodes and commit reconfigurations. So during this "online" part of the protocol, errors can be recorded, etc... and then the logs can be looked at via support bundler or whatever. The reason I didn't propagate the errors up at this level is that in most cases there is nowhere for them to go. The majority of errors take place after receiving a peer message on the bootstrap network. The error could be propagated up to the caller, but that caller is an async task wrapper for the handled message. It could then be passed somewhere else, but it will almost certainly just end up logged because there's no upstream system that made a request that triggered the error. Neither sled-agent, nor nexus triggered the synchronous callback that caused the error and so there is nothing to return. To clarify this point about not having nexus receive errors: when a reconfiguration or LRTQ upgrade is started, it is a single call that results in a bunch of messages, but Nexus doesn't block waiting for completion. Instead it has an RPW that polls the coordinator for status while the coordinator gathers shares and sends prepare messages, etc... and other nodes respond. This keeps nexus responsive, while simplifying the protocol because now the protocol code doesn't have to match requests from nexus to specific responses or deal with timeouts itself. All callbacks are synchronous. But it does mean that there's nobody synchronously waiting when an error occurs. But Nexus is polling, and as part of Trust quorum recording who has prepared and when it has seen enough prepare acks to commit. Nexus is recording this information as part of its RPW and can alert when nodes haven't responded for a given amount of time, leading to more debugging via logs. There is also the problem when the control plane isn't up at all (such as rack cold boot) and nothing can be done besides log the errors. All that was the design strategy here, but not really laid out explicitly in docs. It's also harder to see because I haven't implemented the Nexus parts or any of the async sled-agent code yet. I think it will work, but I wouldn't be surprised if I go back and add more diagnostics later. One thing I was thinking would help is to keep track of stats for certain errors, similar to how alarms are tracked. These stats could then be polled from Nexus, omdb, etc... I may end up doing that, but I'm not convinced yet that the effort is worth it when the logs will have to be accessed for details anyway. |
|
That all makes sense, thanks for the detailed explanation! |
This PR completes the first version of the sans-io trust quorum protocol implementation. LRTQ upgrade can now be started via `Node::coordinate_upgrade_from_lrtq`. This triggers the coordinating node to start collecting the LRTQ key shares so that they can be used to construct the LRTQ rack secret via the bootstore code. After this occurs, a Prepare message is sent out with this old rack secret encrypted in a manner identical to a normal reconfiguration. The prepare and commit paths remain the same. The cluster proptest was updated to sometimes start out with an existing LRTQ configuration and then to upgrade from there. Like normal reconfigurations it allows aborting and pre-empting of the LRTQ upgrade with a new attempt at a higher epoch. In production this is how we "retry" if the coordinating node crashes prior to commit, or more accurately, if nexus can't talk to the coordinating node for some period of time and just moves on. After the LRTQ upgrade commits, normal reconfigurations are run. We also remove unnecessary config related messages in this commit. Since a `Configuration` does not contain sensitive information it can be retrieved when Nexus polls the coordinator before it commits. Then Nexus can save this info and send it in `PrepareAndCommit` messages rather than having the receiving node try to find a live peer with the config prior to collecting shares. This is a nice optimization that reduces protocol complexity a bit. This removal allowed removing the TODO in the message `match` statement in `Node::handle` and completing the protocol.
This PR completes the first version of the sans-io trust quorum protocol
implementation.
LRTQ upgrade can now be started via
Node::coordinate_upgrade_from_lrtq.This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.
The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we "retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.
We also remove unnecessary config related messages in this commit.
Since a
Configurationdoes not contain sensitive information it can beretrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in
PrepareAndCommitmessages ratherthan having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message
matchstatement inNode::handleand completing the protocol.