decide whether each API should be server- or client-side-versioned by davepacheco · Pull Request #7138 · oxidecomputer/omicron

davepacheco · 2024-11-22T00:54:49Z

This PR:

Extends the existing Dropshot/Progenitor API metadata to identify for each API whether it will be versioned on the server-side exclusively or via client-side versioning as well. (I need to write an RFD on this but for a summary of what I mean see this deck).
Adds a new ~~cargo xtask ls-apis dag-check~~ cargo xtask ls-apis check tool that:
- checks the validity of the versioning choices we've made in the metadata (more on this below)
- proposes changes (this may never be useful again but it was very helpful for me to make these choices and I thought it'd be worth keeping the code around)
updates the cargo xtask ls-apis adoc command to include this information in the summary table. The result is that the table in our wiki will include this information. See oxidecomputer/meta#524 for what this looks like. (I will land that PR after this one lands.)

(note: this was called dag-check when I wrote the rest of this but it's now called check)

Copying a summary from the README, after this PR:

All Dropshot/Progenitor APIs in the system are identified with this tool (as far as we know).

Every API is annotated with metadata in api-manifest.toml.

This metadata specifies whether versioning for the API is managed on the server side exclusively or using client-side versioning as well. We've made this choice for every existing API.

There are no cycles in the server-side-managed-API dependency graph.

This tool verifies these properties.

If I haven't messed this up too badly, this is a pretty big step because:

If you add a new API or a new API dependency without adding corresponding metadata, or if that metadata breaks these constraints, you will have to fix this up before you can land the change to Omicron.

I hope this will help keep us from accidentally introducing new circular dependencies.

Goals

My main goals here, in order:

De-risk this whole API versioning plan by figuring out if there is a plausible update DAG. This involves making the decisions about what components (or dependencies) are in the DAG.
Record these decisions durably in the metadata.
Include the tooling that was used to help make these decisions.
Have the tooling verify that the choices are all valid (i.e., that we didn't create an update graph with cycles).

How I made the choice for each API

Where possible, we want APIs to be server-side-versioned only so that clients don't have to think about this. See the deck I linked above for more on this.

Initially, I had all APIs marked as versioned_how = "unknown", built the check in dag-check that the "server-side" graph be acyclic, and then built some facilities in dag-check for it to propose that an API be client- or server-managed based on heuristics like "if this API has no deployed consumers, then it can be server-side only" and "if two components depend directly on each other, one of them must be client-side-managed".

I also started with the assumption that we'll want APIs that Nexus talks to to be server-side-versioned. That's because control generally flows from Nexus to the other services. It'd be pretty painful if Nexus had to have multiple client versions for every service it talked to. Relatedly, the assumption is that the APIs from Nexus -> other services are generally more complex and volatile than in the other direction.

With this, I iteratively looked at the proposals, checked the visual DAG (from cargo xtask ls-apis servers --output-format=dot) and made some judgment calls.

The results

On success, the tool prints out a summary of how our APIs are versioned:

$ cargo xtask ls-apis dag-check
...
Server-managed APIs:

    Clickhouse Cluster Admin for Keepers (clickhouse-admin-keeper-client, exposed by omicron-clickhouse-admin)
    Clickhouse Cluster Admin for Servers (clickhouse-admin-server-client, exposed by omicron-clickhouse-admin)
    Clickhouse Single-Node Cluster Admin (clickhouse-admin-single-client, exposed by omicron-clickhouse-admin)
    CockroachDB Cluster Admin (cockroach-admin-client, exposed by omicron-cockroach-admin)
    Crucible Agent (crucible-agent-client, exposed by crucible-agent)
    Crucible Control (for testing only) (crucible-control-client, exposed by propolis-server)
    Crucible Pantry (crucible-pantry-client, exposed by crucible-pantry)
    Maghemite DDM Admin (ddm-admin-client, exposed by ddmd)
    DNS Server (dns-service-client, exposed by dns-server)
    Management Gateway Service (gateway-client, exposed by omicron-gateway)
    Maghemite MG Admin (mg-admin-client, exposed by mgd)
    External API (oxide-client, exposed by omicron-nexus)
    Oximeter (oximeter-client, exposed by oximeter-collector)
    Propolis (propolis-client, exposed by propolis-server)
    Sled Agent (sled-agent-client, exposed by omicron-sled-agent)
    Wicketd (wicketd-client, exposed by wicketd)


Client-managed API:

    Bootstrap Agent (bootstrap-agent-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)
    Dendrite DPD (dpd-client, exposed by dpd)
        reason: Sled Agent calling DPD creates a circular dependency
    Wicketd Installinator (installinator-client, exposed by wicketd)
        reason: client is provided implicitly by the operator
    Nexus Internal API (nexus-client, exposed by omicron-nexus)
        reason: Circular dependencies between Nexus and other services
    Crucible Repair (repair-client, exposed by crucible-downstairs)
        reason: depends on itself (i.e., instances call each other)
    Repo Depot API (repo-depot-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)


APIs with unknown version management: none

In terms of the client-side-managed APIs:

Bootstrap Agent, Crucible Repair, and Repo Depot: the component exposing these APIs necessarily talk to each other as peers, so they're necessarily circular and must be client-managed
I marked Installinator API as client-managed because the only client is Installinator itself, and that's actually provided at runtime by the operator. We don't control that. If we were ever to rev this API, we'd need a way to validate that the Installinator in the Recovery OS image in the TUF repo that the operator uploaded to Wicket is compatible with the Installinator API in Wicket. In practice we're very unlikely to rev this any time soon.
DPD currently has a circular dependency with Sled Agent so one of them has to be client-side-managed. I chose DPD rather than sled agent because the Sled Agent API is pretty complex and changes quite a lot.
The Nexus internal API is circular with several other APIs and we've long said we wanted this direction to be client-side managed (explained above).

So all of that seems pretty reasonable and everything else was able to be server-side-managed.

Out of curiosity, I did this quick check to see how volatile each API was: in each of the Omicron, Dendrite, Maghemite, Propolis, and Crucible repos, I ran this command to count how many times each of the APIs has been changed in any way in 2024 (almost 11 months):

$ for x in openapi/*.json; do count="$(git log --follow --pretty=%ai $x | grep -c ^2024)"; printf "%3d %s\n" count $x; done

Putting the results from all repos together and sorting them:

  3 openapi/cockroach-admin.json
  3 openapi/crucible-control.json
  3 openapi/crucible-pantry.json
  3 openapi/ddm-admin.json
  3 openapi/dns-server.json
  3 openapi/dsc-control.json
  3 openapi/installinator.json
  3 openapi/repo-depot.json
  4 openapi/clickhouse-admin-server.json
  4 openapi/clickhouse-admin-single.json
  4 openapi/crucible-agent.json
  6 openapi/downstairs-repair.json
  7 openapi/gateway.json
  7 openapi/oximeter.json
 10 openapi/mg-admin.json
 11 openapi/dpd.json
 11 openapi/propolis-server.json
 12 openapi/clickhouse-admin-keeper.json
 22 openapi/bootstrap-agent.json
 24 openapi/wicketd.json
 62 openapi/sled-agent.json
 83 openapi/nexus-internal.json
 87 openapi/nexus.json

The volatility of the Nexus-internal API, which is unfortunately client-side-managed, makes one wonder if we can rework things so that fewer things need to invoke it (e.g., use long polls in the other direction, or use much simpler APIs that just cause Nexus to reach back out to the other service for details? or in the case of Oximeter, eliminate the use of the API altogether as we've discussed?).

Caveats

This is still a little messy. For example, the metadata has ways to mark both an edge (dependency) and a particular API as client-side or server-side managed.

That's because the tool already supported the edge annotations, but while doing this, I decided it would be way too tedious to figure this out if I had to annotate every edge, when for the vast majority of APIs it's going to be one or the other. We should clean that up in the long term but I think it's a fair bit of work and doesn't help de-risk this plan.

I think there are a few APIs where we might want this choice to be per-edge, not per-API. For example: DPD's API is marked client-managed because of its circular dependency with Sled Agent. That may be unavoidable between those two because there's no way to update DPD before all sled agents, nor to update all sled agents before updating DPD (since both are contained in the host OS deployment unit). However, it should be easy to make sure Nexus (and maybe other DPD consumers) are updated after DPD. And that'd be very valuable. So we may want to say that some of these edges are client-side-managed while some of the others are server-side-managed.

sunshowers

Looks great! Just a few comments, nothing major.

sunshowers · 2024-12-12T22:33:19Z

        ptime -m cargo xtask ls-apis deployment-units &&
-        ptime -m cargo xtask ls-apis servers
+        ptime -m cargo xtask ls-apis servers &&
+        ptime -m cargo xtask ls-apis dag-check


Super nit:

cargo xtask ls-apis dag-check

Could this just be check, or check-dag? I personally like anything that does a check have its name start with check.

Okay, I clearly overthought this. I'm trying to resolve this (useful!) guideline with a couple of other rules/guidelines I use:

I usually use the subject-verb[-object] naming convention (hence "dag-check"), both in code and shell command names. This is pretty arbitrary but I do find it results in more discoverable code/commands because related things are grouped together. This would be a bigger deal if we ever add other dag subcommands (i.e., I think dag-check and dag-display is much better than check-dag and display-dag, and also lends itself to dag check / dag display).

I follow clap's default of having the Rust struct/function names match the command names so it's easy to find the source for a command.

So if we change this to check-dag: (1) this gets awkward if we add other dag commands later, and (2) either we change all the Rust names also or we don't, but either way, they diverge from some of these conventions.

Like I said, I'm overthinking this.

But what if we just call this cargo xtask ls-apis check on the outside? I'd have the command (the names in src/bin/ls-apis.rs) call it "check" but the library interface would still call it dag_check()/DagCheck. If we add other checks in the future, we'd probably want them done by ls-apis check anyway.

I went ahead and did this in 26bbf1c.

sounds good!

davepacheco · 2025-03-20T21:27:17Z

I neglected to actually write down the most important results here: the sequence. Running this tool now (using cargo xtask ls-apis deployment-units --output-format dot > input.dot and then dot -T png -o output.png input.dot), I get this:

This shows that:

the following zones only talk to Nexus or other instances of themselves and so can be updated in any order before anything else: internal DNS, external DNS, CockroachDB, Crucible, Oximeter, Clickhouse (all), Pantry
the host OS and switch zone are more complicated
Nexus ought to be last

hawkw · 2025-03-20T21:40:41Z

wow, that is really some of the graphviz of all time...

davepacheco added 12 commits November 21, 2024 15:06

I'm reasonably confident in the versioned_how determinations made so far

45967c5

make some more choices

6b4c172

client-versioning should require a documented reason

7fe2514

print a summary of results

cefcd97

print versioning choice in adoc

8bad6cc

commonize some code

10a3161

use accessors

835717f

clean up, document

ca20b33

verify the non-DAG rules are consistent with the API properties

428cabf

readme edits

ec8f528

clippy

35ff129

fix test

af10710

andrewjstone self-requested a review December 12, 2024 22:30

sunshowers reviewed Dec 12, 2024

View reviewed changes

davepacheco added 5 commits December 13, 2024 11:23

Merge branch 'main' into dap/api-dag

f635ec7

remove unnecessary default

c5bd748

comment propose_upick better

0326794

better document DagCheck

475075a

dag-check -> check

26bbf1c

davepacheco mentioned this pull request Dec 13, 2024

test failed in CI: test_multiple_disks_multiple_snapshots_order_2 #7249

Closed

sunshowers approved these changes Dec 13, 2024

View reviewed changes

Merge branch 'main' into dap/api-dag

b3cff11

davepacheco merged commit 3449a64 into main Dec 14, 2024

davepacheco deleted the dap/api-dag branch December 14, 2024 20:14

davepacheco mentioned this pull request Mar 20, 2025

dpd can be server-side-versioned #7848

Merged

davepacheco mentioned this pull request Jun 12, 2025

ls-apis: mention related RFDs and sections #8324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decide whether each API should be server- or client-side-versioned#7138

decide whether each API should be server- or client-side-versioned#7138
davepacheco merged 18 commits into
mainfrom
dap/api-dag

davepacheco commented Nov 22, 2024 •

edited

Loading

Uh oh!

sunshowers left a comment

Uh oh!

sunshowers Dec 12, 2024

Uh oh!

davepacheco Dec 13, 2024

Uh oh!

davepacheco Dec 13, 2024

Uh oh!

sunshowers Dec 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepacheco commented Mar 20, 2025

Uh oh!

hawkw commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davepacheco commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goals

How I made the choice for each API

The results

Caveats

Uh oh!

sunshowers left a comment

Choose a reason for hiding this comment

Uh oh!

sunshowers Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

davepacheco Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

davepacheco Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

sunshowers Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepacheco commented Mar 20, 2025

Uh oh!

hawkw commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davepacheco commented Nov 22, 2024 •

edited

Loading