-
Notifications
You must be signed in to change notification settings - Fork 948
Design & RFC: orchestrator on raft #175
Description
Objective
cross DC orchestrator deployment with consensus for failovers and mitigating fencing scenarios.
secondary (optional) objective: remove MySQL backend dependency.
Current status
At this time (release ) orchestrator nodes use a shared backend MySQL server for communication & leader election.
The high availability of the orchestrator setup is composed of:
- Multiple
orchestratorservice nodes - MySQL backend high availability.
The former is easily achieved by running more nodes, all of which connect to the backend database.
The latter is achieved via:
1. Circular Master-Master replication
- For example, one MySQL master and one
orchestratorservice indc1, one MySQL master and oneorchestratorservice indc2 - This setup is susceptible to fencing and split brains. A network partition between
dc1anddc2may cause bothorchestratorservices to consider themselves as leaders. Such a network partition is likely to also break cross DC replication thatorchestratoris meant to tackle in the first place, leading to independent failovers by bothorchestratorservices, leading to potential chaos.
2. Galera/XtraDB/InnoDB Cluster setup
- This is a consensus setup, where quorum is required to apply changes.
- Typically recommended as internal-DC setup.
- A setup would have, for example:
3Galera nodes, in multi-writer mode (each node is writable)3orchestratornodes- Each
orchestratornode connects to a specific Galera node - Leadership achieved by way of Galera consensus
- A potential setup is to run Galera cross-DC: one Galera node in each DC
- in such a setup the network partitioning of a single DC (say
dc1) still leaves2connected Galera nodes, making a quorum. By nature oforchestratorleader election, theorchestratorservice indc1cannot become a leader. One of the other two will. - Depending on the size of your clusters and on probing frequency, there may be dozens, hundreds or more statements per sec passing between the Galera nodes.
- in such a setup the network partitioning of a single DC (say
A different use case; issues with current design
With existing design, one orchestrator node is the leader, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probing orchestartor nodes because they all use and write to the same backend DB.
By virtue of this design, only one orchestrator is running failure detection. There is no sense in having multiple orchestrator run failure detection because they all rely on the exact same dataset.
orchestrator uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.
If the orchestrator leader runs on dc1, and dc1 happens to be partitioned away, the leader cannot handle failover to servers in dc2.
The cross-DC Galera layout, suggested above, can solve this case, since the isolated orchestrator node will never be the active node.
We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.
Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.
orchestrator/raft design
The orchestrator/raft design suggests:
orchestratornodes to communicate viaraft- Each
orchestratornode to have its own private database backend server- This does not necessarily have to be a MySQL and we actually have a POC for SQLite as backend DB.
- Backend databases are completely independent and are unaware of each other. Each
orchestratornode is to handle its own private DB. There is no replication. There is no DB High Availability setup. - All
orchestratornodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture). orchestratornodes to communicate between themselves (viaraft) changes that are not detected by probing the MySQL servers.- All
orchestratornodes run failure detection. They each have their own dataset to analyze. - Only one
orchestratoris the leader (decided byraftconsensus) - Only the leader runs recoveries
- The leader may consult with other
orchestratornodes, getting a quorum approval for "do you all agree there's a failure case on this server?"
Noteworthy is that cross-orchestrator communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such as begin-downtime or recovery steps etc. See breakdown further below.
Implications
-
Since each
orchestratornode has its own private backend DB, there's no need to sync the databases. There is no replication. Eachorchestratornode is responsible for maintaining its own DB. -
There is no specific requirement for MySQL. In fact, there's a POC running on SQLite.
- SQLite is embedded in the
orchestratorbinary - A single DB file is required, as opposed to a full blown MySQL deployment
- SQLite is embedded in the
-
We get failure detection quorum. As illustrated above, multiple independent
orchestratornodes will each run failure analysis.- If we run each
orchestratornode in its own DC, then we have fought fencing:- If one DC is partitioned away, the
orchestratorndoe in that DC is isolated, hence cannot be the leader - The other
orchestratornodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.
- If one DC is partitioned away, the
- If we run each
Is this a simpler or a more complex setup?
An orchestrator/raft/sqlite setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to config orchestrator with the raft nodes identities, and orchestrator will take it from there.
An orchestrator/raft/mysql is naturally more complex than orchestrator/raft/sqlite, however:
- You don't need to maintain a MySQL replication setup for the
orchestratorbackend - You don't need to maintain a MySQL failover setup for the
orchestratorbackend
Implementation notes
-
All group members will run independent discovery, and so general discovery information doesn't need to be passed between
orchestratornodes. -
The following information will need to pass between
orchestratornodes as group messages:begin-downtime(but can be discarded if end of downtime is already in the past)end-downtimebegin-maintenance(but can be discarded if end of maintenance is already in the past)end-maintenanceforgetdiscover, so that completely new instances can be shared with allorchestratornodes- we may potentially periodically run a full sync of instances discovery from leader to followers. This would either follow a checksum comparison or be blindly imposed. It can be quite the overhead for very large setups, which is why I'd prefer an incremental, comparison based messaging.
submit-pool-instances, user generated info mapping instances to a poolregister-candidate- a user-based instruction to flag instances with promotion rules- node health (but can be discarded if health declaration timestamp is too old)
- Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
- Recovery (so that all nodes have both recovery history as well as the info needed for anti-flapping)
- including recovery step audit
ack-cluster-recoveries
register-hostname-unresolvederegister-hostname-unresolve
-
Easiest setup would be to load balance
orchestratornodes behind proxy, (e.g.haproxy), such that the proxy would only direct traffic to the active (leader) node.- Implied from this setup: group messages will be sent from the leader (who will natually accept all incoming requests) to the followers.
isAuthorizedForActioncan be extended to reject requests on a follower.- This setup makes for an easy introduction of orchestrator/raft -- I'm going to take this approach at first.
- Also implied is that we must always talk to the leader
- The followers will reject write requests
- We may not used
orchestratorCLI, because that would directly talk to the DB instead of going throughraft.- For read-only operations, this may still be OK. Running locally on a box where
orchestratorservice is running, it is OK to open the SQLite DB file from another process (theorchestratorCLI) and read from it.
- For read-only operations, this may still be OK. Running locally on a box where
-
Add
/api/leader-check, to be used by load-balancers. Leader will return200 OKand followers will return404 Not Found -
We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of
3completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.- We can require the user to do so:
- with
SQLite, copy+paste DB file - with
MySQL, dump + importorchestratorschema - I think this makes sense for starters.
- with
- We can require the user to do so:
cc @github/database-infrastructure @dbussink