TQ: Add protocol support for LRTQ upgrade by andrewjstone · Pull Request #9065 · oxidecomputer/omicron

andrewjstone · 2025-09-23T19:51:51Z

This PR completes the first version of the sans-io trust quorum protocol
implementation.

LRTQ upgrade can now be started via Node::coordinate_upgrade_from_lrtq.
This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.

The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we "retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.

We also remove unnecessary config related messages in this commit.
Since a Configuration does not contain sensitive information it can be
retrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in PrepareAndCommit messages rather
than having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message match statement in Node::handle and completing the protocol.

This PR completes the first version of the sans-io trust quorum protocol implementation. LRTQ upgrade can now be started via `Node::coordinate_upgrade_from_lrtq`. This triggers the coordinating node to start collecting the LRTQ key shares so that they can be used to construct the LRTQ rack secret via the bootstore code. After this occurs, a Prepare message is sent out with this old rack secret encrypted in a manner identical to a normal reconfiguration. The prepare and commit paths remain the same. The cluster proptest was updated to sometimes start out with an existing LRTQ configuration and then to upgrade from there. Like normal reconfigurations it allows aborting and pre-empting of the LRTQ upgrade with a new attempt at a higher epoch. In production this is how we "retry" if the coordinating node crashes prior to commit, or more accurately, if nexus can't talk to the coordinating node for some period of time and just moves on. After the LRTQ upgrade commits, normal reconfigurations are run. We also remove unnecessary config related messages in this commit. Since a `Configuration` does not contain sensitive information it can be retrieved when Nexus polls the coordinator before it commits. Then Nexus can save this info and send it in `PrepareAndCommit` messages rather than having the receiving node try to find a live peer with the config prior to collecting shares. This is a nice optimization that reduces protocol complexity a bit. This removal allowed removing the TODO in the message `match` statement in `Node::handle` and completing the protocol.

plotnick

This looks correct to me, within my limited knowledge of the protocol details. Very exciting to be able to upgrade out of LRTQ!

One general thing I did notice (not just in new code) was a number of places where errors are logged but not propagated up. In these (admittedly unlikely, most of them) cases, how would an operator know something is wrong or where to look? Maybe not worth addressing, but definitely gave me pause.

plotnick · 2025-09-23T22:30:06Z

+                            error!(
+                                self.log,
+                                "Failed to reconstruct LRTQ rack secret";
+                                "err" => err.to_string()
+                            );
+                            return;


How would we know something's gone wrong here aside from checking the log? Shouldn't we propagate errors like this up somehow?

andrewjstone · 2025-09-23T22:59:39Z

Thanks for taking a look @plotnick!

One general thing I did notice (not just in new code) was a number of places where errors are logged but not propagated up. In these (admittedly unlikely, most of them) cases, how would an operator know something is wrong or where to look?

This is a great question. The short answer is that the operator wouldn't know the specific error unless they looked at the logs/support bundle. However, they would be able to see which nodes have not acked or committed when Nexus is up and participating, because the plan is to have an RPW that runs to poll status from the nodes and commit reconfigurations. So during this "online" part of the protocol, errors can be recorded, etc... and then the logs can be looked at via support bundler or whatever.

The reason I didn't propagate the errors up at this level is that in most cases there is nowhere for them to go. The majority of errors take place after receiving a peer message on the bootstrap network. The error could be propagated up to the caller, but that caller is an async task wrapper for the handled message. It could then be passed somewhere else, but it will almost certainly just end up logged because there's no upstream system that made a request that triggered the error. Neither sled-agent, nor nexus triggered the synchronous callback that caused the error and so there is nothing to return.

To clarify this point about not having nexus receive errors: when a reconfiguration or LRTQ upgrade is started, it is a single call that results in a bunch of messages, but Nexus doesn't block waiting for completion. Instead it has an RPW that polls the coordinator for status while the coordinator gathers shares and sends prepare messages, etc... and other nodes respond. This keeps nexus responsive, while simplifying the protocol because now the protocol code doesn't have to match requests from nexus to specific responses or deal with timeouts itself. All callbacks are synchronous. But it does mean that there's nobody synchronously waiting when an error occurs. But Nexus is polling, and as part of Trust quorum recording who has prepared and when it has seen enough prepare acks to commit. Nexus is recording this information as part of its RPW and can alert when nodes haven't responded for a given amount of time, leading to more debugging via logs.

There is also the problem when the control plane isn't up at all (such as rack cold boot) and nothing can be done besides log the errors.

All that was the design strategy here, but not really laid out explicitly in docs. It's also harder to see because I haven't implemented the Nexus parts or any of the async sled-agent code yet. I think it will work, but I wouldn't be surprised if I go back and add more diagnostics later. One thing I was thinking would help is to keep track of stats for certain errors, similar to how alarms are tracked. These stats could then be polled from Nexus, omdb, etc... I may end up doing that, but I'm not convinced yet that the effort is worth it when the logs will have to be accessed for details anyway.

plotnick · 2025-09-24T14:05:58Z

That all makes sense, thanks for the detailed explanation!

This PR completes the first version of the sans-io trust quorum protocol implementation. LRTQ upgrade can now be started via `Node::coordinate_upgrade_from_lrtq`. This triggers the coordinating node to start collecting the LRTQ key shares so that they can be used to construct the LRTQ rack secret via the bootstore code. After this occurs, a Prepare message is sent out with this old rack secret encrypted in a manner identical to a normal reconfiguration. The prepare and commit paths remain the same. The cluster proptest was updated to sometimes start out with an existing LRTQ configuration and then to upgrade from there. Like normal reconfigurations it allows aborting and pre-empting of the LRTQ upgrade with a new attempt at a higher epoch. In production this is how we "retry" if the coordinating node crashes prior to commit, or more accurately, if nexus can't talk to the coordinating node for some period of time and just moves on. After the LRTQ upgrade commits, normal reconfigurations are run. We also remove unnecessary config related messages in this commit. Since a `Configuration` does not contain sensitive information it can be retrieved when Nexus polls the coordinator before it commits. Then Nexus can save this info and send it in `PrepareAndCommit` messages rather than having the receiving node try to find a live peer with the config prior to collecting shares. This is a nice optimization that reduces protocol complexity a bit. This removal allowed removing the TODO in the message `match` statement in `Node::handle` and completing the protocol.

andrewjstone force-pushed the tq-lrtq-protocol branch from 0e4ea0a to f9e3eea Compare September 23, 2025 19:55

andrewjstone force-pushed the tq-lrtq-protocol branch from f9e3eea to f2540b9 Compare September 23, 2025 19:58

andrewjstone mentioned this pull request Sep 18, 2025

Trust Quorum Protocol Implementation Tracking #8965

Closed

27 tasks

plotnick approved these changes Sep 23, 2025

View reviewed changes

andrewjstone merged commit 842226d into main Sep 24, 2025
17 checks passed

andrewjstone deleted the tq-lrtq-protocol branch September 24, 2025 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TQ: Add protocol support for LRTQ upgrade#9065

TQ: Add protocol support for LRTQ upgrade#9065
andrewjstone merged 1 commit into
mainfrom
tq-lrtq-protocol

andrewjstone commented Sep 23, 2025 •

edited

Loading

Uh oh!

plotnick left a comment

Uh oh!

plotnick Sep 23, 2025

Uh oh!

andrewjstone commented Sep 23, 2025

Uh oh!

plotnick commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewjstone commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plotnick left a comment

Choose a reason for hiding this comment

Uh oh!

plotnick Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

andrewjstone commented Sep 23, 2025

Uh oh!

plotnick commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewjstone commented Sep 23, 2025 •

edited

Loading