[sled-agent] Stop using /var/oxide (on ramdisk), start using config dataset (on M.2)#2999
Conversation
…zones, zone_name -> zone_type, config -> ledger
## Before this PR Running on rack2 and calling `omicron-package uninstall` would involve a fatal termination of the connection, as it would delete the `cxgbe0/ll` and `cxgbe1/ll` IP addresses necessary for contacting the sled. ## After this PR Those addresses are left alone. This is pretty useful for development, as it allows us to run `uninstall` to cleanly wipe a Gimlet, preparing it for future "clean installs".
| /// | ||
| /// NOTE: Be careful when modifying this path - the installation tools will | ||
| /// **remove the entire directory** to re-install/uninstall the system. | ||
| pub const OMICRON_CONFIG_PATH: &'static str = "/var/oxide"; |
There was a problem hiding this comment.
I also grepped for it, and didn't find any references to /var/oxide that exist within the global zone.
| // Wait for at least the M.2 we booted from to show up. | ||
| // | ||
| // This gives the bootstrap agent a chance to read locally-stored | ||
| // configs if any exist. | ||
| loop { | ||
| match agent.storage_resources.boot_disk().await { | ||
| Some(disk) => { | ||
| info!(agent.log, "Found boot disk M.2: {disk:?}"); | ||
| break; | ||
| } | ||
| None => { | ||
| info!(agent.log, "Waiting for boot disk M.2..."); | ||
| tokio::time::sleep(core::time::Duration::from_millis(250)) | ||
| .await; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This is new. Without this, it's possible that we simply don't catch our own bootdisk, due to the storage manager lagging behind.
I could possibly update this to less of a "retry-based" interface, but it seems to work.
There was a problem hiding this comment.
This makes sense to me. Seems fine to retry, since we need M.2s to boot.
| let config_dirs = self | ||
| .storage_resources | ||
| .all_m2_mountpoints(sled_hardware::disk::CONFIG_DATASET) |
There was a problem hiding this comment.
This is basically the same code it was before, but we're removing everything within the dataset, rather than deleting the dataset itself.
| } | ||
| } | ||
|
|
||
| self.storage |
There was a problem hiding this comment.
This was a bug! Without this, the bootstrap agent might miss "seeing" the new disks which are happening after hardware tracking starts, but before we call .monitor.
By adding this call, the "full hardware scan" re-emits those updates.
|
|
||
| let storage_manager = StorageManager::new(&log).await; | ||
|
|
||
| // If our configuration asks for synthetic zpools, insert them now. |
There was a problem hiding this comment.
Now that the bootstrap agent has a dependency on M.2-based storage, we rely on these synthetic zpools being inserted earlier.
| } | ||
|
|
||
| impl<T: Ledgerable> Ledger<T> { | ||
| /// Creates a ledger with a new initial value, ready to be written to |
There was a problem hiding this comment.
I sorta needed to change these APIs for the RSS markers to work okay. Basically, if there was no file, I wanted to know about it, rather than getting a "default" value back.
This PR removes the constraint that Ledgerable also implement Default, and updates these constructors to cope.
|
This is ready for review! |
| // Wait for at least the M.2 we booted from to show up. | ||
| // | ||
| // This gives the bootstrap agent a chance to read locally-stored | ||
| // configs if any exist. | ||
| loop { | ||
| match agent.storage_resources.boot_disk().await { | ||
| Some(disk) => { | ||
| info!(agent.log, "Found boot disk M.2: {disk:?}"); | ||
| break; | ||
| } | ||
| None => { | ||
| info!(agent.log, "Waiting for boot disk M.2..."); | ||
| tokio::time::sleep(core::time::Duration::from_millis(250)) | ||
| .await; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This makes sense to me. Seems fine to retry, since we need M.2s to boot.
This PR finds any spot where we use
/var/oxide-- a configuration directory which exists on the ramdisk -- and replaces it with storage in the M.2 datasets.Fixes #2970
Fixes #2998