Skip to content

[sled-agent] Stop using /var/oxide (on ramdisk), start using config dataset (on M.2)#2999

Merged
smklein merged 67 commits into
mainfrom
var-oxide-how-about-nar-oxide
May 9, 2023
Merged

[sled-agent] Stop using /var/oxide (on ramdisk), start using config dataset (on M.2)#2999
smklein merged 67 commits into
mainfrom
var-oxide-how-about-nar-oxide

Conversation

@smklein

@smklein smklein commented May 3, 2023

Copy link
Copy Markdown
Collaborator

This PR finds any spot where we use /var/oxide -- a configuration directory which exists on the ramdisk -- and replaces it with storage in the M.2 datasets.

Fixes #2970
Fixes #2998

smklein added 30 commits April 28, 2023 09:54
…zones, zone_name -> zone_type, config -> ledger
## Before this PR

Running on rack2 and calling `omicron-package uninstall` would involve a
fatal termination of the connection, as it would delete the `cxgbe0/ll`
and `cxgbe1/ll` IP addresses necessary for contacting the sled.

## After this PR

Those addresses are left alone. This is pretty useful for development,
as it allows us to run `uninstall` to cleanly wipe a Gimlet, preparing
it for future "clean installs".
Comment thread common/src/lib.rs
///
/// NOTE: Be careful when modifying this path - the installation tools will
/// **remove the entire directory** to re-install/uninstall the system.
pub const OMICRON_CONFIG_PATH: &'static str = "/var/oxide";

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also grepped for it, and didn't find any references to /var/oxide that exist within the global zone.

Comment on lines +395 to +411
// Wait for at least the M.2 we booted from to show up.
//
// This gives the bootstrap agent a chance to read locally-stored
// configs if any exist.
loop {
match agent.storage_resources.boot_disk().await {
Some(disk) => {
info!(agent.log, "Found boot disk M.2: {disk:?}");
break;
}
None => {
info!(agent.log, "Waiting for boot disk M.2...");
tokio::time::sleep(core::time::Duration::from_millis(250))
.await;
}
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new. Without this, it's possible that we simply don't catch our own bootdisk, due to the storage manager lagging behind.

I could possibly update this to less of a "retry-based" interface, but it seems to work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Seems fine to retry, since we need M.2s to boot.

Comment on lines +848 to +850
let config_dirs = self
.storage_resources
.all_m2_mountpoints(sled_hardware::disk::CONFIG_DATASET)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the same code it was before, but we're removing everything within the dataset, rather than deleting the dataset itself.

}
}

self.storage

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug! Without this, the bootstrap agent might miss "seeing" the new disks which are happening after hardware tracking starts, but before we call .monitor.

By adding this call, the "full hardware scan" re-emits those updates.


let storage_manager = StorageManager::new(&log).await;

// If our configuration asks for synthetic zpools, insert them now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the bootstrap agent has a dependency on M.2-based storage, we rely on these synthetic zpools being inserted earlier.

Comment thread sled-agent/src/ledger.rs
}

impl<T: Ledgerable> Ledger<T> {
/// Creates a ledger with a new initial value, ready to be written to

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorta needed to change these APIs for the RSS markers to work okay. Basically, if there was no file, I wanted to know about it, rather than getting a "default" value back.

This PR removes the constraint that Ledgerable also implement Default, and updates these constructors to cope.

@smklein

smklein commented May 5, 2023

Copy link
Copy Markdown
Collaborator Author

This is ready for review!

Comment thread sled-agent/src/bootstrap/agent.rs Outdated
Comment on lines +395 to +411
// Wait for at least the M.2 we booted from to show up.
//
// This gives the bootstrap agent a chance to read locally-stored
// configs if any exist.
loop {
match agent.storage_resources.boot_disk().await {
Some(disk) => {
info!(agent.log, "Found boot disk M.2: {disk:?}");
break;
}
None => {
info!(agent.log, "Waiting for boot disk M.2...");
tokio::time::sleep(core::time::Duration::from_millis(250))
.await;
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Seems fine to retry, since we need M.2s to boot.

@smklein smklein enabled auto-merge (squash) May 9, 2023 17:59
@smklein smklein merged commit 4b7fee5 into main May 9, 2023
@smklein smklein deleted the var-oxide-how-about-nar-oxide branch May 9, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants