Skip to content

decide whether each API should be server- or client-side-versioned#7138

Merged
davepacheco merged 18 commits into
mainfrom
dap/api-dag
Dec 14, 2024
Merged

decide whether each API should be server- or client-side-versioned#7138
davepacheco merged 18 commits into
mainfrom
dap/api-dag

Conversation

@davepacheco

@davepacheco davepacheco commented Nov 22, 2024

Copy link
Copy Markdown
Collaborator

This PR:

  • Extends the existing Dropshot/Progenitor API metadata to identify for each API whether it will be versioned on the server-side exclusively or via client-side versioning as well. (I need to write an RFD on this but for a summary of what I mean see this deck).
  • Adds a new cargo xtask ls-apis dag-check cargo xtask ls-apis check tool that:
    • checks the validity of the versioning choices we've made in the metadata (more on this below)
    • proposes changes (this may never be useful again but it was very helpful for me to make these choices and I thought it'd be worth keeping the code around)
  • updates the cargo xtask ls-apis adoc command to include this information in the summary table. The result is that the table in our wiki will include this information. See oxidecomputer/meta#524 for what this looks like. (I will land that PR after this one lands.)

(note: this was called dag-check when I wrote the rest of this but it's now called check)

Copying a summary from the README, after this PR:

  • All Dropshot/Progenitor APIs in the system are identified with this tool (as far as we know).
  • Every API is annotated with metadata in api-manifest.toml.
  • This metadata specifies whether versioning for the API is managed on the server side exclusively or using client-side versioning as well. We've made this choice for every existing API.
  • There are no cycles in the server-side-managed-API dependency graph.

This tool verifies these properties.

If I haven't messed this up too badly, this is a pretty big step because:

If you add a new API or a new API dependency without adding corresponding metadata, or if that metadata breaks these constraints, you will have to fix this up before you can land the change to Omicron.

I hope this will help keep us from accidentally introducing new circular dependencies.

Goals

My main goals here, in order:

  • De-risk this whole API versioning plan by figuring out if there is a plausible update DAG. This involves making the decisions about what components (or dependencies) are in the DAG.
  • Record these decisions durably in the metadata.
  • Include the tooling that was used to help make these decisions.
  • Have the tooling verify that the choices are all valid (i.e., that we didn't create an update graph with cycles).

How I made the choice for each API

Where possible, we want APIs to be server-side-versioned only so that clients don't have to think about this. See the deck I linked above for more on this.

Initially, I had all APIs marked as versioned_how = "unknown", built the check in dag-check that the "server-side" graph be acyclic, and then built some facilities in dag-check for it to propose that an API be client- or server-managed based on heuristics like "if this API has no deployed consumers, then it can be server-side only" and "if two components depend directly on each other, one of them must be client-side-managed".

I also started with the assumption that we'll want APIs that Nexus talks to to be server-side-versioned. That's because control generally flows from Nexus to the other services. It'd be pretty painful if Nexus had to have multiple client versions for every service it talked to. Relatedly, the assumption is that the APIs from Nexus -> other services are generally more complex and volatile than in the other direction.

With this, I iteratively looked at the proposals, checked the visual DAG (from cargo xtask ls-apis servers --output-format=dot) and made some judgment calls.

The results

On success, the tool prints out a summary of how our APIs are versioned:

$ cargo xtask ls-apis dag-check
...
Server-managed APIs:

    Clickhouse Cluster Admin for Keepers (clickhouse-admin-keeper-client, exposed by omicron-clickhouse-admin)
    Clickhouse Cluster Admin for Servers (clickhouse-admin-server-client, exposed by omicron-clickhouse-admin)
    Clickhouse Single-Node Cluster Admin (clickhouse-admin-single-client, exposed by omicron-clickhouse-admin)
    CockroachDB Cluster Admin (cockroach-admin-client, exposed by omicron-cockroach-admin)
    Crucible Agent (crucible-agent-client, exposed by crucible-agent)
    Crucible Control (for testing only) (crucible-control-client, exposed by propolis-server)
    Crucible Pantry (crucible-pantry-client, exposed by crucible-pantry)
    Maghemite DDM Admin (ddm-admin-client, exposed by ddmd)
    DNS Server (dns-service-client, exposed by dns-server)
    Management Gateway Service (gateway-client, exposed by omicron-gateway)
    Maghemite MG Admin (mg-admin-client, exposed by mgd)
    External API (oxide-client, exposed by omicron-nexus)
    Oximeter (oximeter-client, exposed by oximeter-collector)
    Propolis (propolis-client, exposed by propolis-server)
    Sled Agent (sled-agent-client, exposed by omicron-sled-agent)
    Wicketd (wicketd-client, exposed by wicketd)


Client-managed API:

    Bootstrap Agent (bootstrap-agent-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)
    Dendrite DPD (dpd-client, exposed by dpd)
        reason: Sled Agent calling DPD creates a circular dependency
    Wicketd Installinator (installinator-client, exposed by wicketd)
        reason: client is provided implicitly by the operator
    Nexus Internal API (nexus-client, exposed by omicron-nexus)
        reason: Circular dependencies between Nexus and other services
    Crucible Repair (repair-client, exposed by crucible-downstairs)
        reason: depends on itself (i.e., instances call each other)
    Repo Depot API (repo-depot-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)


APIs with unknown version management: none

In terms of the client-side-managed APIs:

  • Bootstrap Agent, Crucible Repair, and Repo Depot: the component exposing these APIs necessarily talk to each other as peers, so they're necessarily circular and must be client-managed
  • I marked Installinator API as client-managed because the only client is Installinator itself, and that's actually provided at runtime by the operator. We don't control that. If we were ever to rev this API, we'd need a way to validate that the Installinator in the Recovery OS image in the TUF repo that the operator uploaded to Wicket is compatible with the Installinator API in Wicket. In practice we're very unlikely to rev this any time soon.
  • DPD currently has a circular dependency with Sled Agent so one of them has to be client-side-managed. I chose DPD rather than sled agent because the Sled Agent API is pretty complex and changes quite a lot.
  • The Nexus internal API is circular with several other APIs and we've long said we wanted this direction to be client-side managed (explained above).

So all of that seems pretty reasonable and everything else was able to be server-side-managed.

Out of curiosity, I did this quick check to see how volatile each API was: in each of the Omicron, Dendrite, Maghemite, Propolis, and Crucible repos, I ran this command to count how many times each of the APIs has been changed in any way in 2024 (almost 11 months):

$ for x in openapi/*.json; do count="$(git log --follow --pretty=%ai $x | grep -c ^2024)"; printf "%3d %s\n" count $x; done

Putting the results from all repos together and sorting them:

  3 openapi/cockroach-admin.json
  3 openapi/crucible-control.json
  3 openapi/crucible-pantry.json
  3 openapi/ddm-admin.json
  3 openapi/dns-server.json
  3 openapi/dsc-control.json
  3 openapi/installinator.json
  3 openapi/repo-depot.json
  4 openapi/clickhouse-admin-server.json
  4 openapi/clickhouse-admin-single.json
  4 openapi/crucible-agent.json
  6 openapi/downstairs-repair.json
  7 openapi/gateway.json
  7 openapi/oximeter.json
 10 openapi/mg-admin.json
 11 openapi/dpd.json
 11 openapi/propolis-server.json
 12 openapi/clickhouse-admin-keeper.json
 22 openapi/bootstrap-agent.json
 24 openapi/wicketd.json
 62 openapi/sled-agent.json
 83 openapi/nexus-internal.json
 87 openapi/nexus.json

The volatility of the Nexus-internal API, which is unfortunately client-side-managed, makes one wonder if we can rework things so that fewer things need to invoke it (e.g., use long polls in the other direction, or use much simpler APIs that just cause Nexus to reach back out to the other service for details? or in the case of Oximeter, eliminate the use of the API altogether as we've discussed?).

Caveats

This is still a little messy. For example, the metadata has ways to mark both an edge (dependency) and a particular API as client-side or server-side managed.

That's because the tool already supported the edge annotations, but while doing this, I decided it would be way too tedious to figure this out if I had to annotate every edge, when for the vast majority of APIs it's going to be one or the other. We should clean that up in the long term but I think it's a fair bit of work and doesn't help de-risk this plan.

I think there are a few APIs where we might want this choice to be per-edge, not per-API. For example: DPD's API is marked client-managed because of its circular dependency with Sled Agent. That may be unavoidable between those two because there's no way to update DPD before all sled agents, nor to update all sled agents before updating DPD (since both are contained in the host OS deployment unit). However, it should be easy to make sure Nexus (and maybe other DPD consumers) are updated after DPD. And that'd be very valuable. So we may want to say that some of these edges are client-side-managed while some of the others are server-side-managed.

@andrewjstone andrewjstone self-requested a review December 12, 2024 22:30

@sunshowers sunshowers left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few comments, nothing major.

Comment thread .github/buildomat/build-and-test.sh Outdated
ptime -m cargo xtask ls-apis deployment-units &&
ptime -m cargo xtask ls-apis servers
ptime -m cargo xtask ls-apis servers &&
ptime -m cargo xtask ls-apis dag-check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit:

cargo xtask ls-apis dag-check

Could this just be check, or check-dag? I personally like anything that does a check have its name start with check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I clearly overthought this. I'm trying to resolve this (useful!) guideline with a couple of other rules/guidelines I use:

  • I usually use the subject-verb[-object] naming convention (hence "dag-check"), both in code and shell command names. This is pretty arbitrary but I do find it results in more discoverable code/commands because related things are grouped together. This would be a bigger deal if we ever add other dag subcommands (i.e., I think dag-check and dag-display is much better than check-dag and display-dag, and also lends itself to dag check / dag display).
  • I follow clap's default of having the Rust struct/function names match the command names so it's easy to find the source for a command.

So if we change this to check-dag: (1) this gets awkward if we add other dag commands later, and (2) either we change all the Rust names also or we don't, but either way, they diverge from some of these conventions.

Like I said, I'm overthinking this.

But what if we just call this cargo xtask ls-apis check on the outside? I'd have the command (the names in src/bin/ls-apis.rs) call it "check" but the library interface would still call it dag_check()/DagCheck. If we add other checks in the future, we'd probably want them done by ls-apis check anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and did this in 26bbf1c.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Comment thread dev-tools/ls-apis/src/api_metadata.rs
Comment thread dev-tools/ls-apis/src/system_apis.rs
Comment thread dev-tools/ls-apis/src/system_apis.rs
Comment thread dev-tools/ls-apis/src/system_apis.rs Outdated
Comment thread dev-tools/ls-apis/README.adoc Outdated
Comment thread dev-tools/ls-apis/src/api_metadata.rs Outdated
@davepacheco davepacheco merged commit 3449a64 into main Dec 14, 2024
@davepacheco davepacheco deleted the dap/api-dag branch December 14, 2024 20:14
@davepacheco

Copy link
Copy Markdown
Collaborator Author

I neglected to actually write down the most important results here: the sequence. Running this tool now (using cargo xtask ls-apis deployment-units --output-format dot > input.dot and then dot -T png -o output.png input.dot), I get this:

apis-by-unit

This shows that:

  • the following zones only talk to Nexus or other instances of themselves and so can be updated in any order before anything else: internal DNS, external DNS, CockroachDB, Crucible, Oximeter, Clickhouse (all), Pantry
  • the host OS and switch zone are more complicated
  • Nexus ought to be last

@hawkw

hawkw commented Mar 20, 2025

Copy link
Copy Markdown
Member

wow, that is really some of the graphviz of all time...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants