During #1880 we ran into a case where the Sled Agent discovered ZFS datasets with the "oxide:uuid" property set and attempted to create new Clickhouse and CockroachDB zones using them. This is expected behavior when the system has restarted (and, for now, when sled agent restarts). The problem in this case is that Sled Agent also needs metadata that's stored in /var/oxide, but that was missing.
Relevant log entries:
04:54:04.634Z INFO SledAgent/StorageManager: StorageWorker loading fs cockroachdb on zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
04:54:04.681Z INFO SledAgent/StorageManager: Loading Dataset from /var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml
04:54:04.696Z WARN SledAgent/StorageManager: StorageWorker Failed to load dataset: Failed to perform I/O: read config for pool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset DatasetName { pool_name: "oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b", dataset_name: "cockroachdb" } from "/var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml": No such file or directory (os error 2)
As I understand it:
- when RSS asks Sled Agent to create a dataset for Clickhouse or CockroachDB, we create a ZFS dataset, assign it a uuid, and set the oxide:uuid property on the dataset to the value we generated
- then we do more dataset/zone initialization stuff
- then we record additional metadata about the zone (like its IP address?) into
/var/oxide/$pool_uuid/$dataset_uuid
We identified at least two ways we can wind up with a dataset with nothing in /var/oxide for it:
- If we bail out during step 2 above
- If someone reinstalls Omicron (which wipes /var/oxide)
This should not come up in production, but if it ever does, presumably we want to flag this as an issue requiring Oxide support and otherwise ignore the dataset.
Tangentially: if we know all the metadata that we need at the time we create the ZFS dataset, we could store that into ZFS properties and we can set those properties when the dataset is created. This way, there would never be a time that the dataset exists and the metadata doesn't. (This is a primary use case for ZFS user properties.) We might still have the problem that this dataset is essentially invalid -- see #1884. If we went down the path of replacing the /var/oxide metadata with ZFS user properties, we'd need some other way to identify when the dataset is from a previous install. Maybe tag each one with a uuid that's created when RSS determines the initial plan, then only consider datasets with the expected uuid? (Feel free to ignore all this if it doesn't seem like it'll simplify things.)
During #1880 we ran into a case where the Sled Agent discovered ZFS datasets with the "oxide:uuid" property set and attempted to create new Clickhouse and CockroachDB zones using them. This is expected behavior when the system has restarted (and, for now, when sled agent restarts). The problem in this case is that Sled Agent also needs metadata that's stored in /var/oxide, but that was missing.
Relevant log entries:
As I understand it:
/var/oxide/$pool_uuid/$dataset_uuidWe identified at least two ways we can wind up with a dataset with nothing in /var/oxide for it:
This should not come up in production, but if it ever does, presumably we want to flag this as an issue requiring Oxide support and otherwise ignore the dataset.
Tangentially: if we know all the metadata that we need at the time we create the ZFS dataset, we could store that into ZFS properties and we can set those properties when the dataset is created. This way, there would never be a time that the dataset exists and the metadata doesn't. (This is a primary use case for ZFS user properties.) We might still have the problem that this dataset is essentially invalid -- see #1884. If we went down the path of replacing the /var/oxide metadata with ZFS user properties, we'd need some other way to identify when the dataset is from a previous install. Maybe tag each one with a uuid that's created when RSS determines the initial plan, then only consider datasets with the expected uuid? (Feel free to ignore all this if it doesn't seem like it'll simplify things.)