- Suppose RSS gets partway through setup, and fails, for whatever reason. Suppose that all sleds have bootstrap agents which know how to start sled agents. This information is recorded durably, so sled agents can reboot autonomously from that point onwards. This is important - remember, the sled agents will be booting on their own from this point.
- On reboot, RSS should attempt to re-inject the requests, using the same
rss-plan.toml file.
- If it does this, RSS will be stuck here:
|
// Forward the sled initialization requests to our sled-agent. |
|
local_bootstrap_agent |
|
.initialize_sleds( |
|
plan.iter() |
|
.map(move |(bootstrap_addr, allocation)| { |
|
( |
|
*bootstrap_addr, |
|
allocation.initialization_request.clone(), |
|
maybe_rack_secret_shares |
|
.as_mut() |
|
.map(|shares| shares.next().unwrap()), |
|
) |
|
}) |
|
.collect(), |
|
) |
|
.await |
|
.map_err(SetupServiceError::SledInitialization)?; |
- This call to
initialize_sleds ends up blocked on a request to bootstrap agents, for them to initialize their sled agents:
|
let sled_agent_initialize = || async { |
|
client |
|
.start_sled(request, trust_quorum_share.clone()) |
|
.await |
|
.map_err(BackoffError::transient)?; |
|
|
|
Ok::<(), BackoffError<bootstrap_agent_client::Error>>(()) |
|
}; |
|
|
|
let log_failure = |error, _| { |
|
warn!(log, "failed to start sled agent"; "error" => ?error); |
|
}; |
|
retry_notify(internal_service_policy(), sled_agent_initialize, log_failure) |
|
.await?; |
- I can see in the logs, this returns the following error from the bootstrap agent itself:
|
Request::SledAgentRequest(request, _trust_quorum_share) => { |
|
warn!( |
|
log, "Received sled agent request after we're initialized"; |
|
"request" => ?request, |
|
); |
|
Err("Sled agent already initialized".to_string()) |
|
} |
I believe this is a bug - instead of returning Ok(()), RSS will be stuck behind a sled agent that booted by itself.
@jgallagher , WDYT - would it be okay to just idempotently return Ok if the sled agent already exists?
rss-plan.tomlfile.omicron/sled-agent/src/rack_setup/service.rs
Lines 560 to 576 in b4366f1
initialize_sledsends up blocked on a request to bootstrap agents, for them to initialize their sled agents:omicron/sled-agent/src/bootstrap/rss_handle.rs
Lines 92 to 105 in f8f076a
omicron/sled-agent/src/bootstrap/server.rs
Lines 362 to 368 in b4366f1
I believe this is a bug - instead of returning
Ok(()), RSS will be stuck behind a sled agent that booted by itself.@jgallagher , WDYT - would it be okay to just idempotently return
Okif the sled agent already exists?