Skip to content

Partial Failure in RSS setup can cause hang due to sled agent setup #1879

@smklein

Description

@smklein
  • Suppose RSS gets partway through setup, and fails, for whatever reason. Suppose that all sleds have bootstrap agents which know how to start sled agents. This information is recorded durably, so sled agents can reboot autonomously from that point onwards. This is important - remember, the sled agents will be booting on their own from this point.
  • On reboot, RSS should attempt to re-inject the requests, using the same rss-plan.toml file.
  • If it does this, RSS will be stuck here:

// Forward the sled initialization requests to our sled-agent.
local_bootstrap_agent
.initialize_sleds(
plan.iter()
.map(move |(bootstrap_addr, allocation)| {
(
*bootstrap_addr,
allocation.initialization_request.clone(),
maybe_rack_secret_shares
.as_mut()
.map(|shares| shares.next().unwrap()),
)
})
.collect(),
)
.await
.map_err(SetupServiceError::SledInitialization)?;

  • This call to initialize_sleds ends up blocked on a request to bootstrap agents, for them to initialize their sled agents:

let sled_agent_initialize = || async {
client
.start_sled(request, trust_quorum_share.clone())
.await
.map_err(BackoffError::transient)?;
Ok::<(), BackoffError<bootstrap_agent_client::Error>>(())
};
let log_failure = |error, _| {
warn!(log, "failed to start sled agent"; "error" => ?error);
};
retry_notify(internal_service_policy(), sled_agent_initialize, log_failure)
.await?;

  • I can see in the logs, this returns the following error from the bootstrap agent itself:

Request::SledAgentRequest(request, _trust_quorum_share) => {
warn!(
log, "Received sled agent request after we're initialized";
"request" => ?request,
);
Err("Sled agent already initialized".to_string())
}

I believe this is a bug - instead of returning Ok(()), RSS will be stuck behind a sled agent that booted by itself.

@jgallagher , WDYT - would it be okay to just idempotently return Ok if the sled agent already exists?

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and Managementbootstrap servicesFor those occasions where you want the rack to turn on

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions