Skip to content

TQ: Add protocol support for LRTQ upgrade#9065

Merged
andrewjstone merged 1 commit into
mainfrom
tq-lrtq-protocol
Sep 24, 2025
Merged

TQ: Add protocol support for LRTQ upgrade#9065
andrewjstone merged 1 commit into
mainfrom
tq-lrtq-protocol

Conversation

@andrewjstone

@andrewjstone andrewjstone commented Sep 23, 2025

Copy link
Copy Markdown
Contributor

This PR completes the first version of the sans-io trust quorum protocol
implementation.

LRTQ upgrade can now be started via Node::coordinate_upgrade_from_lrtq.
This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.

The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we "retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.

We also remove unnecessary config related messages in this commit.
Since a Configuration does not contain sensitive information it can be
retrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in PrepareAndCommit messages rather
than having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message match statement in Node::handle and completing the protocol.

This PR completes the first version of the sans-io trust quorum protocol
implementation.

LRTQ upgrade can now be started via `Node::coordinate_upgrade_from_lrtq`.
This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.

The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we "retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.

We also remove unnecessary config related messages in this commit.
Since a `Configuration` does not contain sensitive information it can be
retrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in `PrepareAndCommit` messages rather
than having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message `match` statement in `Node::handle` and completing the protocol.

@plotnick plotnick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me, within my limited knowledge of the protocol details. Very exciting to be able to upgrade out of LRTQ!

One general thing I did notice (not just in new code) was a number of places where errors are logged but not propagated up. In these (admittedly unlikely, most of them) cases, how would an operator know something is wrong or where to look? Maybe not worth addressing, but definitely gave me pause.

Comment on lines +587 to +592
error!(
self.log,
"Failed to reconstruct LRTQ rack secret";
"err" => err.to_string()
);
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we know something's gone wrong here aside from checking the log? Shouldn't we propagate errors like this up somehow?

@andrewjstone

Copy link
Copy Markdown
Contributor Author

Thanks for taking a look @plotnick!

One general thing I did notice (not just in new code) was a number of places where errors are logged but not propagated up. In these (admittedly unlikely, most of them) cases, how would an operator know something is wrong or where to look?

This is a great question. The short answer is that the operator wouldn't know the specific error unless they looked at the logs/support bundle. However, they would be able to see which nodes have not acked or committed when Nexus is up and participating, because the plan is to have an RPW that runs to poll status from the nodes and commit reconfigurations. So during this "online" part of the protocol, errors can be recorded, etc... and then the logs can be looked at via support bundler or whatever.

The reason I didn't propagate the errors up at this level is that in most cases there is nowhere for them to go. The majority of errors take place after receiving a peer message on the bootstrap network. The error could be propagated up to the caller, but that caller is an async task wrapper for the handled message. It could then be passed somewhere else, but it will almost certainly just end up logged because there's no upstream system that made a request that triggered the error. Neither sled-agent, nor nexus triggered the synchronous callback that caused the error and so there is nothing to return.

To clarify this point about not having nexus receive errors: when a reconfiguration or LRTQ upgrade is started, it is a single call that results in a bunch of messages, but Nexus doesn't block waiting for completion. Instead it has an RPW that polls the coordinator for status while the coordinator gathers shares and sends prepare messages, etc... and other nodes respond. This keeps nexus responsive, while simplifying the protocol because now the protocol code doesn't have to match requests from nexus to specific responses or deal with timeouts itself. All callbacks are synchronous. But it does mean that there's nobody synchronously waiting when an error occurs. But Nexus is polling, and as part of Trust quorum recording who has prepared and when it has seen enough prepare acks to commit. Nexus is recording this information as part of its RPW and can alert when nodes haven't responded for a given amount of time, leading to more debugging via logs.

There is also the problem when the control plane isn't up at all (such as rack cold boot) and nothing can be done besides log the errors.

All that was the design strategy here, but not really laid out explicitly in docs. It's also harder to see because I haven't implemented the Nexus parts or any of the async sled-agent code yet. I think it will work, but I wouldn't be surprised if I go back and add more diagnostics later. One thing I was thinking would help is to keep track of stats for certain errors, similar to how alarms are tracked. These stats could then be polled from Nexus, omdb, etc... I may end up doing that, but I'm not convinced yet that the effort is worth it when the logs will have to be accessed for details anyway.

@plotnick

Copy link
Copy Markdown
Contributor

That all makes sense, thanks for the detailed explanation!

@andrewjstone andrewjstone merged commit 842226d into main Sep 24, 2025
17 checks passed
@andrewjstone andrewjstone deleted the tq-lrtq-protocol branch September 24, 2025 17:48
leftwo pushed a commit that referenced this pull request Sep 26, 2025
This PR completes the first version of the sans-io trust quorum protocol
implementation.

LRTQ upgrade can now be started via
`Node::coordinate_upgrade_from_lrtq`.
This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.

The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we
"retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.

We also remove unnecessary config related messages in this commit.
Since a `Configuration` does not contain sensitive information it can be
retrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in `PrepareAndCommit` messages rather
than having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message `match` statement in `Node::handle` and completing the protocol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants