Skip to content

RFC: Orchestrator as a native Vitess component #6612

@sougou

Description

@sougou

Introduction

With #6582, we are starting on a Vitess-specific fork of the Orchestrator that was originally created by @shlomi-noach.

Previously, Vitess provided an integration to loosely cooperate with an install of Orchestrator managing automated failover for the MySQL instances underlying Vitess. However, users often encountered corner cases where the two sides could make conflicting changes, potentially leaving the system in an inconsistent state.

We intend to resolve these issues and provide a smooth and unified experience to the end-user.

Approach

Many approaches were considered:

  1. Build a Vitess-native failover mechanism.
  2. Copy portions of the Orchestrator code into Vitess.
  3. Extend Orchestrator to allow tighter integration and cooperation with Vitess.
  4. Modify Orchestrator to become a first-class Vitess component.

Of the above approaches, we have decided on option 4. We reached this conclusion after studying the Orchestrator code, and realizing that it is written with the intent to allow customization.

Architecturally, the Orchestrator is closest in function to the work performed by vtctld and has some overlapping functionality. For example, stop-replica vs StopReplica, graceful-master-takeover vs PlannedReparentShard.

On the other hand: For massive clusters with >10K nodes, it may be better to divide the Orchestrator's responsibility into specific keyspaces or shards. In this situation, it starts to deviate from the vtctld model. This requires further brainstorming.

To fork or not to fork

We debated extending the Orchestrator to accommodate Vitess into the existing code base, vs. forking the code. The conclusion was to fork due to the following reasons:

  • Model: Orchestrator’s failover code depends on the discovered replication graph that connected replicas to their master. This model is substantially different from Vitess where cluster membership is strongly dictated by the keyspace-shard an instance belongs to.
  • Responsibility: Orchestrator is primarily responsible for performing failovers. However, this separation gave rise to many corner cases, especially when Vitess would wire-up replication using a view of the world that differed from what Orchestrator saw. For this integration to work smoothly, the Orchestrator needs to take responsibility for all functions of the cluster.
  • Opinions: Orchestrator has used a hands-free approach with the intent to accommodate with the desires of how administrators would like to configure their MySQL instances. In Vitess, there is an expectation that a cluster should be mostly self-managed, with certain critical state transitions occurring manually. So, many of the flexibilities that Orchestrator accommodates are not allowed in Vitess.
  • Velocity: If we forked, we can move faster because there is no fear of breaking use cases that are not supported by Vitess.

End Product

The end product could be a unified component that merges Orchestrator and vtctld. We could either preserve the old vtctld name, or give a brand new name for the new binary.

This will extend vtctld’s responsibility to not only execute user commands, but also initiate actions against the cluster to remedy any problems that are automatically solvable.

Also, vtctld is designed to cooperate with other vtctlds present. This is in line with the Orchestrator's philosophy. Although they use different methods right now, they will be unified at the end.

As for the web UI, the Orchestrator has a pleasant and intuitive graphical interface. This will be used as the starting point, and the vtctld functionality will be adapted around this as starting point.

There should be links to vttablets from Orchestrator, similar to how there are links from the vtctld web UI to vttablets today.

Orchestrator also has a more structured CLI interface, which will likely be preferred over the one that vtctld organically grew.

If we decide to not merge with vtctld, we'll have to consider either keeping Orchestrator as a separate component, or see what it will look like if it ran inside vttablets.

Pluggable Durability

Users come with varying tolerances on durability. Additionally, cloud environments define complex topologies. For example, AWS has zones and regions. In order to meet these requirements, and future-proof ourselves, we will implement a pluggable durability policy. This will be based on an extension of FlexPaxos.

Essentially, this pluggability will allow us to address all existing known use cases as well as future ones we have not heard of. More details will follow.

Details

This is a preliminary drill-down of what we think is needed. So, it’s presented as somewhat of a laundry list. We will refine this as we move forward.

MVP

A Minimum Viable Product is being developed. The latest code is currently in https://github.com/planetscale/vitess/tree/ss-oc2-vt-mode. It’s likely to evolve. Those who wish to preview upcoming work can look for branches in the PlanetScale repo that are prefixed with ss-oc. At this point, the MVP has the following functionality:

  • Automatically discover all tablets and MySQL instances by walking the Vitess topology.
  • Assign default rules for discovered items:
  • Promotion rules
  • DataCenters
  • Automatically initialize a brand new cluster (replacing InitShardMaster).
  • When a new vttablet comes up with its MySQL, wire it up to replicate from the current master.
  • GracefulTakeOver performs the same function as PlannedReparentShard
  • Dead-master recovery, with Orchestrator directly performing the TabletExternallyReparented work. This still does not include everything that EmergencyReparentShard does today, but should be extended.
  • Heal the cluster if tablets and MySQL instances do not agree on who is the master
  • Ensure masters and replicas have the correct read/write permissions and ensure they are connected correctly.
  • Discovery scans per cell and handles the case of a whole cell being down.

The MVP is only a starting point. There is more work that needs to be done:

Code culture

  • Port existing integration tests, make them pass and make them part of CI/CD
  • Golint, govet, etc: These are currently failing, and need to be fixed.
  • Write new tests to meet Vitess coding standards (coverage, etc).
  • Context: Orchestrator does not use Golang contexts. But Vitess calls need them. So, we need to make sure we wire up all the TODO contexts.
  • Change to use Vitess code infrastructure: servenv, logging, etc.

Metadata

  • Most of the metadata state internally maintained by Orchestrator can be automatically reconstructed through Vitess’ discovery * functionality. However, there are a few settings that need to be persisted. For example, states like Maintenance and Downtime * * Windows. These should be moved into the Vitess topo.

Improvements

  • Use Vitess GTID code locally to compare GTIDs instead of calling into MySQL.
  • Some retries may need an exponential back-off.
  • Use persistent connections to tabletmanager to avoid frequent reconnects.

Changes in behavior

  • DeadMaster: Currently uses file:pos to find the most progressed replica. This will not work if replicas can be detached. This should be changed to use GTID comparison.
  • DeadMaster: Should use all instances of a cluster instead of just the ones in the replication graph.
  • Topology Recovery feeds into Vitess’s event log.
  • Enhanced security: Orchestrator should receive replication credentials from vttablets.

New functionality

  • Ability to mark a cell as down.
  • Subscribe to Tablet streaming health to get real-time updates on changes in tablet states.
  • Let vttablet’s streaming health provide all the MySQL info instead of Orchestrator fetching from two sources?
  • Watch vttablets and reparent on their failure (even if MySQL is up).
  • Support Multi-schema: multiple vttablets pointed at the same MySQL instance.
  • Rewind/backtrack: Detect and fix extraneous GTIDs from recovered masters.
  • Tags: Publish discoverable tags to the vttablets that will allow vtgates to change their routings accordingly; instead of fixed tablet types.
  • Pluggable durability
  • Go through vttablet for all actions so everything is uniform
  • Look at moving the recover logic inside vttablet to better conform to the consensus doc

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions