Skip to content

Sled Agent confused by mismatched state between ZFS and /var/oxide #1885

@davepacheco

Description

@davepacheco

During #1880 we ran into a case where the Sled Agent discovered ZFS datasets with the "oxide:uuid" property set and attempted to create new Clickhouse and CockroachDB zones using them. This is expected behavior when the system has restarted (and, for now, when sled agent restarts). The problem in this case is that Sled Agent also needs metadata that's stored in /var/oxide, but that was missing.

Relevant log entries:

04:54:04.634Z  INFO SledAgent/StorageManager: StorageWorker loading fs cockroachdb on zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
04:54:04.681Z  INFO SledAgent/StorageManager: Loading Dataset from /var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml
04:54:04.696Z  WARN SledAgent/StorageManager: StorageWorker Failed to load dataset: Failed to perform I/O: read config for pool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset DatasetName { pool_name: "oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b", dataset_name: "cockroachdb" } from "/var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml": No such file or directory (os error 2)

As I understand it:

  1. when RSS asks Sled Agent to create a dataset for Clickhouse or CockroachDB, we create a ZFS dataset, assign it a uuid, and set the oxide:uuid property on the dataset to the value we generated
  2. then we do more dataset/zone initialization stuff
  3. then we record additional metadata about the zone (like its IP address?) into /var/oxide/$pool_uuid/$dataset_uuid

We identified at least two ways we can wind up with a dataset with nothing in /var/oxide for it:

  • If we bail out during step 2 above
  • If someone reinstalls Omicron (which wipes /var/oxide)

This should not come up in production, but if it ever does, presumably we want to flag this as an issue requiring Oxide support and otherwise ignore the dataset.


Tangentially: if we know all the metadata that we need at the time we create the ZFS dataset, we could store that into ZFS properties and we can set those properties when the dataset is created. This way, there would never be a time that the dataset exists and the metadata doesn't. (This is a primary use case for ZFS user properties.) We might still have the problem that this dataset is essentially invalid -- see #1884. If we went down the path of replacing the /var/oxide metadata with ZFS user properties, we'd need some other way to identify when the dataset is from a previous install. Maybe tag each one with a uuid that's created when RSS determines the initial plan, then only consider datasets with the expected uuid? (Feel free to ignore all this if it doesn't seem like it'll simplify things.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions