Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

Design & RFC: orchestrator on raft #175

@shlomi-noach

Description

@shlomi-noach

Objective

cross DC orchestrator deployment with consensus for failovers and mitigating fencing scenarios.

secondary (optional) objective: remove MySQL backend dependency.

Current status

At this time (release ) orchestrator nodes use a shared backend MySQL server for communication & leader election.

The high availability of the orchestrator setup is composed of:

  • Multiple orchestrator service nodes
  • MySQL backend high availability.

The former is easily achieved by running more nodes, all of which connect to the backend database.

The latter is achieved via:

1. Circular Master-Master replication

  • For example, one MySQL master and one orchestrator service in dc1, one MySQL master and one orchestrator service in dc2
  • This setup is susceptible to fencing and split brains. A network partition between dc1 and dc2 may cause both orchestrator services to consider themselves as leaders. Such a network partition is likely to also break cross DC replication that orchestrator is meant to tackle in the first place, leading to independent failovers by both orchestrator services, leading to potential chaos.

2. Galera/XtraDB/InnoDB Cluster setup

  • This is a consensus setup, where quorum is required to apply changes.
  • Typically recommended as internal-DC setup.
  • A setup would have, for example:
    • 3 Galera nodes, in multi-writer mode (each node is writable)
    • 3 orchestrator nodes
    • Each orchestrator node connects to a specific Galera node
    • Leadership achieved by way of Galera consensus
  • A potential setup is to run Galera cross-DC: one Galera node in each DC
    • in such a setup the network partitioning of a single DC (say dc1) still leaves 2 connected Galera nodes, making a quorum. By nature of orchestrator leader election, the orchestrator service in dc1 cannot become a leader. One of the other two will.
    • Depending on the size of your clusters and on probing frequency, there may be dozens, hundreds or more statements per sec passing between the Galera nodes.

A different use case; issues with current design

With existing design, one orchestrator node is the leader, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probing orchestartor nodes because they all use and write to the same backend DB.

By virtue of this design, only one orchestrator is running failure detection. There is no sense in having multiple orchestrator run failure detection because they all rely on the exact same dataset.

orchestrator uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.

If the orchestrator leader runs on dc1, and dc1 happens to be partitioned away, the leader cannot handle failover to servers in dc2.

The cross-DC Galera layout, suggested above, can solve this case, since the isolated orchestrator node will never be the active node.

We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.

Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.

orchestrator/raft design

The orchestrator/raft design suggests:

  • orchestrator nodes to communicate via raft
  • Each orchestrator node to have its own private database backend server
    • This does not necessarily have to be a MySQL and we actually have a POC for SQLite as backend DB.
  • Backend databases are completely independent and are unaware of each other. Each orchestrator node is to handle its own private DB. There is no replication. There is no DB High Availability setup.
  • All orchestrator nodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture).
  • orchestrator nodes to communicate between themselves (via raft) changes that are not detected by probing the MySQL servers.
  • All orchestrator nodes run failure detection. They each have their own dataset to analyze.
  • Only one orchestrator is the leader (decided by raft consensus)
  • Only the leader runs recoveries
  • The leader may consult with other orchestrator nodes, getting a quorum approval for "do you all agree there's a failure case on this server?"

Noteworthy is that cross-orchestrator communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such as begin-downtime or recovery steps etc. See breakdown further below.

Implications

  • Since each orchestrator node has its own private backend DB, there's no need to sync the databases. There is no replication. Each orchestrator node is responsible for maintaining its own DB.

  • There is no specific requirement for MySQL. In fact, there's a POC running on SQLite.

    • SQLite is embedded in the orchestrator binary
    • A single DB file is required, as opposed to a full blown MySQL deployment
  • We get failure detection quorum. As illustrated above, multiple independent orchestrator nodes will each run failure analysis.

    • If we run each orchestrator node in its own DC, then we have fought fencing:
      • If one DC is partitioned away, the orchestrator ndoe in that DC is isolated, hence cannot be the leader
      • The other orchestrator nodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.

Is this a simpler or a more complex setup?

An orchestrator/raft/sqlite setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to config orchestrator with the raft nodes identities, and orchestrator will take it from there.

An orchestrator/raft/mysql is naturally more complex than orchestrator/raft/sqlite, however:

  • You don't need to maintain a MySQL replication setup for the orchestrator backend
  • You don't need to maintain a MySQL failover setup for the orchestrator backend

Implementation notes

  • All group members will run independent discovery, and so general discovery information doesn't need to be passed between orchestrator nodes.

  • The following information will need to pass between orchestrator nodes as group messages:

    • begin-downtime (but can be discarded if end of downtime is already in the past)
    • end-downtime
    • begin-maintenance (but can be discarded if end of maintenance is already in the past)
    • end-maintenance
    • forget
    • discover, so that completely new instances can be shared with all orchestrator nodes
      • we may potentially periodically run a full sync of instances discovery from leader to followers. This would either follow a checksum comparison or be blindly imposed. It can be quite the overhead for very large setups, which is why I'd prefer an incremental, comparison based messaging.
    • submit-pool-instances, user generated info mapping instances to a pool
    • register-candidate - a user-based instruction to flag instances with promotion rules
    • node health (but can be discarded if health declaration timestamp is too old)
    • Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
    • Recovery (so that all nodes have both recovery history as well as the info needed for anti-flapping)
      • including recovery step audit
      • ack-cluster-recoveries
    • register-hostname-unresolve
    • deregister-hostname-unresolve
  • Easiest setup would be to load balance orchestrator nodes behind proxy, (e.g. haproxy), such that the proxy would only direct traffic to the active (leader) node.

    • Implied from this setup: group messages will be sent from the leader (who will natually accept all incoming requests) to the followers.
    • isAuthorizedForAction can be extended to reject requests on a follower.
    • This setup makes for an easy introduction of orchestrator/raft -- I'm going to take this approach at first.
    • Also implied is that we must always talk to the leader
      • The followers will reject write requests
      • We may not used orchestrator CLI, because that would directly talk to the DB instead of going through raft.
        • For read-only operations, this may still be OK. Running locally on a box where orchestrator service is running, it is OK to open the SQLite DB file from another process (the orchestrator CLI) and read from it.
  • Add /api/leader-check, to be used by load-balancers. Leader will return 200 OK and followers will return 404 Not Found

  • We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of 3 completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.

    • We can require the user to do so:
      • with SQLite, copy+paste DB file
      • with MySQL, dump + import orchestrator schema
      • I think this makes sense for starters.

cc @github/database-infrastructure @dbussink

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions