Design & RFC: orchestrator on raft

# Objective

cross DC `orchestrator` deployment with consensus for failovers and mitigating fencing scenarios.

secondary (optional) objective: remove MySQL backend dependency.

## Current status

At this time (release  ) `orchestrator` nodes use a shared backend MySQL server for communication & leader election.

The high availability of the `orchestrator` setup is composed of:

- Multiple `orchestrator` service nodes
- MySQL backend high availability.

The former is easily achieved by running more nodes, all of which connect to the backend database.

The latter is achieved via:

#### 1. Circular Master-Master replication

- For example, one MySQL master and one `orchestrator` service in `dc1`, one MySQL master and one `orchestrator` service in `dc2`
- This setup is susceptible to fencing and split brains. A network partition between `dc1` and `dc2` may cause both `orchestrator` services to consider themselves as leaders. Such a network partition is likely to also break cross DC replication that `orchestrator` is meant to tackle in the first place, leading to independent failovers by both `orchestrator` services, leading to potential chaos.

#### 2. Galera/XtraDB/InnoDB Cluster setup

- This is a consensus setup, where quorum is required to apply changes.
- Typically recommended as internal-DC setup.
- A setup would have, for example:
  - `3` Galera nodes, in multi-writer mode (each node is writable)
  - `3` `orchestrator` nodes
  - Each `orchestrator` node connects to a specific Galera node
  - Leadership achieved by way of Galera consensus
- A potential setup is to run Galera cross-DC: one Galera node in each DC
  - in such a setup the network partitioning of a single DC (say `dc1`) still leaves `2` connected Galera nodes, making a quorum. By nature of `orchestrator` leader election, the `orchestrator` service in `dc1` cannot become a leader. One of the other two will.
  - Depending on the size of your clusters and on probing frequency, there may be dozens, hundreds or more statements per sec passing between the Galera nodes.

## A different use case; issues with current design

With existing design, one `orchestrator` node is the `leader`, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probing `orchestartor` nodes because they all use and write to the same backend DB.

By virtue of this design, only one `orchestrator` is running failure detection. There is no sense in having multiple `orchestrator` run failure detection because they all rely on the exact same dataset.

`orchestrator` uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.

If the `orchestrator` leader runs on `dc1`, and `dc1` happens to be partitioned away, the leader cannot handle failover to servers in `dc2`.

The cross-DC Galera layout, suggested above, **can solve this case**, since the isolated `orchestrator` node will never be the active node.

We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.

Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.


## orchestrator/raft design

The `orchestrator/raft` design suggests:

- `orchestrator` nodes to communicate via `raft`
- Each `orchestrator` node to have its own _private_ database backend server
  - This does not necessarily have to be a MySQL and we actually have a POC for SQLite as backend DB.
- Backend databases are completely independent and are unaware of each other. Each `orchestrator` node is to handle its own private DB. There is no replication. There is no DB High Availability setup.
- All `orchestrator` nodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture).
- `orchestrator` nodes to communicate between themselves (via `raft`) changes that are not detected by probing the MySQL servers.
- All `orchestrator` nodes run failure detection. They each have their own dataset to analyze.
- Only one `orchestrator` is the _leader_ (decided by `raft` consensus)
- Only the _leader_ runs recoveries
- The leader may consult with other `orchestrator` nodes, getting a quorum approval for "do you all agree there's a failure case on this server?"

Noteworthy is that cross-`orchestrator` communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such as `begin-downtime` or recovery steps etc. See breakdown further below.

### Implications

- Since each `orchestrator` node has its own private backend DB, there's no need to sync the databases. There is no replication. Each `orchestrator` node is responsible for maintaining its own DB.
- There is no specific requirement for MySQL. In fact, there's a [POC running on SQLite](https://github.com/github/orchestrator/pull/63).

  - SQLite is embedded in the `orchestrator` binary
  - A single DB file is required, as opposed to a full blown MySQL deployment

- We get **failure detection quorum**. As illustrated above, multiple independent `orchestrator` nodes will each run failure analysis.
  - If we run each `orchestrator` node in its own DC, then we have fought fencing:
    - If one DC is partitioned away, the `orchestrator` ndoe in that DC is isolated, hence cannot be the leader
    - The other `orchestrator` nodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.

### Is this a simpler or a more complex setup?

An `orchestrator/raft/sqlite` setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to config `orchestrator` with the `raft` nodes identities, and `orchestrator` will take it from there.

An `orchestrator/raft/mysql` is naturally more complex than `orchestrator/raft/sqlite`, however:

- You don't need to maintain a MySQL replication setup for the `orchestrator` backend
- You don't need to maintain a MySQL failover setup for the `orchestrator` backend

### Implementation notes

- All group members will run independent discovery, and so general discovery information doesn't need to be passed between `orchestrator` nodes.
- The following information will need to pass between `orchestrator` nodes as group messages:
  - `begin-downtime` (but can be discarded if end of downtime is already in the past)
  - `end-downtime`
  - `begin-maintenance` (but can be discarded if end of maintenance is already in the past)
  - `end-maintenance`
  - `forget`
  - `discover`, so that completely new instances can be shared with all `orchestrator` nodes
    - we may potentially periodically run a full sync of instances discovery from leader to followers. This would either follow a checksum comparison or be blindly imposed. It can be quite the overhead for very large setups, which is why I'd prefer an incremental, comparison based messaging.
  - `submit-pool-instances`, user generated info mapping instances to a pool
  - `register-candidate` - a user-based instruction to flag instances with promotion rules
  - node health (but can be discarded if health declaration timestamp is too old)
  - Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
  - Recovery (so that all nodes have both recovery history as well as the info needed for anti-flapping)
    - including recovery step audit
    - `ack-cluster-recoveries`
  - `register-hostname-unresolve`
  - `deregister-hostname-unresolve`

- Easiest setup would be to load balance `orchestrator` nodes behind proxy, (e.g. `haproxy`), such that the proxy would only direct traffic to the active (leader) node.
  - Implied from this setup: group messages will be sent from the leader (who will natually accept all incoming requests) to the followers.
  - `isAuthorizedForAction` can be extended to reject requests on a follower.
  - This setup makes for an easy introduction of orchestrator/raft -- I'm going to take this approach at first.
  - Also implied is that we must always talk to the leader
    - The followers will reject _write_ requests
    - We may not used `orchestrator` CLI, because that would directly talk to the DB instead of going through `raft`.
      - For read-only operations, this may still be OK. Running locally on a box where `orchestrator` service is running, it is OK to open the SQLite DB file from another process (the `orchestrator` CLI) and read from it.
- Add `/api/leader-check`, to be used by load-balancers. Leader will return `200 OK` and followers will return `404 Not Found`


- We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of `3` completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.

  - We can require the user to do so:
    - with `SQLite`, copy+paste DB file
    - with `MySQL`, dump + import `orchestrator` schema
    - I think this makes sense for starters.

cc @github/database-infrastructure @dbussink 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design & RFC: orchestrator on raft #175

Objective

Current status

1. Circular Master-Master replication

2. Galera/XtraDB/InnoDB Cluster setup

A different use case; issues with current design

orchestrator/raft design

Implications

Is this a simpler or a more complex setup?

Implementation notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design & RFC: orchestrator on raft #175

Description

Objective

Current status

1. Circular Master-Master replication

2. Galera/XtraDB/InnoDB Cluster setup

A different use case; issues with current design

orchestrator/raft design

Implications

Is this a simpler or a more complex setup?

Implementation notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions