-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Build a system for gradually rolling out OS updates to Fedora CoreOS machines in a way that can be centrally managed by FCOS developers.
Background: Container Linux
Container Linux updates are rolled out gradually, typically over a 24 to 72 hour period. If a major bug is caught before the rollout is complete, we can suspend the rollout while we investigate.
On CL, rollouts are implemented in CoreUpdate by hashing the machine ID with the new OS version and comparing a few bits against a threshold which increases over time. If a machine automatically checks in but doesn't meet the current threshold, CoreUpdate responds that no update is available. If the user manually initiates an update request, the threshold is ignored and CoreUpdate provides the update.
Major bugs can be caught in two ways. CoreUpdate has a status reporting mechanism, so we can notice if many machines are failing to update. The status reports are not very granular, however, and thus not very debuggable. More commonly, a user reports a problem to coreos/bugs and we manually triage the issue and pause rollout if the problem appears serious.
For each update group (~= channel), CoreUpdate only knows about one target OS version at any given time. This is awkward for several reasons. If a machine running version A is not yet eligible for a new version C currently being rolled out, or if the rollout of C is paused, the machine should still be able to update to the previous version B, but CoreUpdate doesn't know how to do that. In addition, there's no way to require that machines on versions < P first update to P before updating to any version > P. (We call that functionality an "update barrier".) As a result, compatibility hacks for updating from particular old versions have to be carried in the OS forever, since there can be no guarantee that all older machines have updated.
CoreUpdate collects metrics about each machine that checks in: its update channel, its state in the client state machine, what OS version is running, what version was originally installed, the OEM ID (platform) of the machine, and its checkin history. This works okay but gives us an incomplete picture of the installed base: we do not receive any information about machines behind private CoreUpdate servers, behind a third-party update server such as CoreRoller, or which have updates disabled.
Fedora CoreOS
CoreUpdate, update_engine, and the Omaha protocol will not be used in Fedora CoreOS. A successor update protocol, Cincinnati, is being developed for OpenShift, and it appears that we'll be able to adapt it for Fedora CoreOS. I believe this involves:
- Using the existing Cincinnati wire protocol and server code
- Writing a graph builder to generate an update graph from FCOS release metadata
- Writing an update client that queries a Cincinnati policy engine and invokes rpm-ostree
Server requirements
...and nice-to-haves, and reifications of the second-system effect:
- Knows about update streams, versions, etc.
- Updates are rolled out gradually over an administratively-selected period
- Rollouts are automatically paused based on client error reports
- Can be run on-premises
- Can automatically fetch from an upstream server, or receive updates manually in air-gapped environments
- Supports update barriers
- Supports multiple layered rollouts, or at least a last-known-good base version
- When rollouts are paused, can fall back to offering an appropriate older release
- Provides insight into the client state machine
Client requirements
- Requests an update from the server and passes the resulting image reference to rpm-ostree
- Adequate status reporting, to the server and to the user
- Mechanism for forcing an update that overrides rate limits
Metrics
Metrics should be handled by a separate system. Coupling metrics to the update protocol would provide the same sort of incomplete picture as in CoreUpdate, and in any event Cincinnati is not designed to collect client metrics. This probably means that certain of the features above, such as automatic rollout suspension or insight into the client state machine, will need to be handled outside the update protocol.
cc @crawford for corrections and additional details.