Atomic slot migration HLD

## High level design

We want to introduce a new Slot Migration process that atomically migrates a set of slots from a source node to a target node.

We will implement a net new slot serialization strategy which emulates the behavior of BGSave but sends a series of RESTORE COMMANDS instead of the RDB snapshot serialization format. The strategy being that we fork when migrating a batch of slots, we establish the new connection, and then iterate over the slot space and sent the restore commands. Once all the data for a slot is migrated, the accumulated buffer for the slot is drained. Once the buffer is drained, we block all write commands for the slot while we do a coordinated handoff, this can be upgraded to 2PC for cluster v2, to complete the transaction and make sure the target node takes ownership of the slot. Once the target has confirmed it owns the slot, the source node will asynchronously purge its slot data. This process will be repeated for all slots being migrated, one at a time. The forked process will be used to migrate a set of slots, once this set of slots has been migrated, the fork process ends.

On the source node there will be a cron job that will drive a state machine which will execute the different phases of the slot migration process. If an individual migration fails, it will be rolled back and the process will be aborted. Slot migrations can therefor partially succeed, with some slots owned by the target and some owned by the source.

Benefits from this new implementation compared to the current one:

* Atomic slot migration - no splitting of slots.
* Multi-key operations are seamless while a slot migration is ongoing
* No extra hop for client requests
* Easy to recover from failures

## New commands

```
CLUSTER MIGRATESLOTS [NODE <Target node ID>] [SHARD <Target shard ID>] SLOTRANGES start end [start end]
```
Initiate a slot migration between a source and target node. Optionally supports a “shard ID” that can be used instead of a specific node. If a migration failed, it will not be automatically retried. It will be up to the caller to determine the migration has failed and to restart the process.

```
CLUSTER IMPORTSLOT START <slot> <Source node ID>
```
Control command sent between a source and target node to indicate that this link will be used to transfer the ownership of a slot. 

```
CLUSTER IMPORTSLOT END <slot> <source-node-id> [<expected-keys-in-slot>]
```
Control function to indicate that the transfer has completed and the node should now take ownership of the data. Upon receiving this signal, the target will take ownership of the slot and broadcast to all nodes the new epoch.

## New Info fields

```
slot_migration_role: [source|target]
if role == source
slot_migration_migrating_slots:
slot_migration_state:[see below]
slot_migration_start_time:
slot_migration_keys_processed:N
slot_migration_keys_total:N
slot_migration_perc:X%
slot_migration_last_duration_ms:
slot_migration_last_status:
slot_migration_last_failure_reason:

if role == target
slot_migration_importing_slots:
slot_migration_keys_processed:N
```

## Major components

### Slot migration state machine/SlotMigrationCron (slotmigration.c)

A new cron job will be introduced that runs every iteration of the clusterCron as long as a slot migration is running. It will be responsible for driving the state machine of an ongoing slot migrations (on the source node) forward.

These are the different states:
*WAIT_TARGET_CONNECTION_SETUP:* Established connection with the target node. Sends the CLUSTER IMPORTSLOT START ... to the target node. Process is forked. 
*SM_SRC_START:* The request to migrate a slot has started. Start the actual slot migration tsk.
*SM_SRC_FULL_SYNC_IN_PROGRESS:* The migration is moved into this state when the fork has triggered and is now streaming data to the target. Once this process is complete, the fork will signal its parent process to begin transferring the accumulated output buffer.
*SM_SRC_OUTPUT_TRANSFER_IN_PROGRESS:* The full sync has completed. At this point, the primary is draining the client output buffer accumulated during the fork for that slot.
*SM_SRC_WAIT_ACK:* Once this process is complete, the primary will initiate the committing of the slot by sending `CLUSTER MIGRATESLOT END` . At the beginning of this state, we will start blocking all writes to the slot being migrated. In this state, we are waiting for the acknowledgement from the target that it took ownership of the slot. This is done by the target publishing a cluster bus gossip message stating that is has ownership of the slot with a higher epoch number. The source receiving such message would indicate the target has successfully claimed the slot. Once the acknowledgement is received, we will commit the slot transfer and begin to purge the data in our slots. The source node does not own the slot anymore.
*SM_SRC_CLEANUP:* Final state where the migration was either successful, cancelled or failed. Clean up the slot migration states and close the connection to the target.

<img width="1118" alt="image (2)" src="https://user-images.githubusercontent.com/34459052/180065563-30f1c3ad-d2ed-4df6-8473-3e39abab37c0.png">

The states SRC_START through SRC_WAIT_ACK will be repeated for every slot that is being migrated. So there will be one connection for all of the slots being migrated. There will be one fork per batch of slots being migrated.

From the point of view of the target node the following happens:

* Receive CLUSTER IMPORTSLOT START <slot> from the source node: start of the migration of a slot is replicated to its replicas then acknowledged. We are now expecting to get slot data from this link. The node processes the incoming commands from the source node to load the data into the slot, and relays the incoming slot data to its replicas.
* Receive CLUSTER IMPORTSLOT END <slot> from the source node: propagate the end of the slot migration to its replicas then acknowledges the end of the import slot. At this point the target node now owns the slot and can start serving traffic for it. A primary will synchronously replicate the `CLUSTER MIGRATESLOT END` to its replica, waiting for the acknowledgment, this will prevent the replica from needing to reconcile having data without knowing the state of the mgration.

### Fork based slot migration full transfer (slotmigration_sync.c)

At the start of the slot migration, a client will be created on the source node and connect to the target node. This will be re-used from the migrate command code. At this point, a system fork will be executed, which will be used to send the point in time transfer of the data in the slot. The data transfer will occur on the main thread, with a named pipe being used to transfer the data from the child to the parent. This is link is used to enable TLS to be used, since a FD can not be shared between two processes since the internal TLS state will diverge. The process will periodically send information about the progress of the migration. 

Once the full sync is completed, the fork will send a status message to the parent process using a pipe to signal that it has finished. If the status code was success, the primary will then begin transferring the queued up replication on the relevant slots. All incoming write traffic will be checked for which slot it is being accessed. If the slot being accessed is one of the slots currently being migrated, during propagation it will be queued on a dedicated buffer for the migration. Although the buffers could be kept independent for each slot, they will be grouped together in this implementation for simplicity.

Once this queue has been completely drained, a special “client pause” will be started on the slots being migrated, which will block all writes against those slots. This pause will remain in effect while a coordinated handoff is sent committing the ownership on the target.

If at any point in time, the connection between the source and target is killed, the migration will be aborted on both ends. 

We will be forking once per batch of slots to be migrated, and serially sending the content of each slots in the batch. We believe there is an implicit tradeoff here between CoB usage, CoW usage, and time it takes to pause. We will optimize the number of slots in a batch to be migrated in a single fork based on performance testing. We think there will be some sweet spot for the number of slots to migrate per fork.

### Async slot flush (slotflush.c)

A new state will be introduced for the slot state, which is slot flush. A slot that is flushing has data but is unreachable because the node does not own the slot anymore. A node will be unable to “import” data on this slot until the data has been fully flushed. The flush will happen in a new cron job which will be triggered at the end of the SM_SRC_WAIT_ACK state.
Since the flush of a slot will happen asynchronously, we can start migrating the next slot while flushing of a slot is still happening. If flushing a slot takes longer than migrating the next slot, then the next slot to be flushed is queued up and flushed once the previous slots have been completely flushed.

This might also benefit from the investigation done here: https://github.com/redis/redis/issues/10589. Having a dictionary per slot would make it very easy to “unlink” the entire dictionary, but will require more investigation.

## Failure modes

*Source or target node dies during transfer:* In this failure mode, the migration will be aborted once some pre-defined timeouts are reached. Once a target realizes that the migration target has been disconnected, it will initiate a slot flush of its data. Once a source realizes the target is dead, it will rollback the migration process.

*Target primary dies during handoff:* If the target primary dies after sending the CLUSTER IMPORT END and committing it to replicas. The newly promoted replica will be able to acknowledge slot ownership, completing the handoff. If no replica is there to acknowledge it, the slot will be lost with the remaining slots on that node.

*Target never acknowledges handoff:* If we have sent CLUSTER IMPORT END but never received acknowledgment that it was applied. In this event, we will wait for CLUSTER_NODE_TIMEOUT, at which point we will unblock the writes against the source. If the target is still alive, it will still be able to take over ownership of the slot via clusterbus message with a higher epoch of it owning the slot. This is consistent with OSS Redis behavior, which allows inconsistent writes within nodes that aren’t talking to each other. 
Note about clusterv2, the 2PC will be directed towards the topology director instead of the target node. 

*Target rejects a command:* Any error received during the migration will result in a failure of the migration. Typical failures would be ‘incorrect acls’ or ‘OOM’.

## Notes about cluster v2

The same general strategy will apply. The only difference is that the handoff will be a 2PC facilitated through the FC/TD instead of just between the source and target.

## Open questions

* Should we include the Slot Migration client output buffer size as a new metric? Should we treat the target client as a replica client?

## Comment updates
- [ ] Update to do a single fork per batch of slots.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomic slot migration HLD #10933

High level design

New commands

New Info fields

Major components

Slot migration state machine/SlotMigrationCron (slotmigration.c)

Fork based slot migration full transfer (slotmigration_sync.c)

Async slot flush (slotflush.c)

Failure modes

Notes about cluster v2

Open questions

Comment updates

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Atomic slot migration HLD #10933

Description

High level design

New commands

New Info fields

Major components

Slot migration state machine/SlotMigrationCron (slotmigration.c)

Fork based slot migration full transfer (slotmigration_sync.c)

Async slot flush (slotflush.c)

Failure modes

Notes about cluster v2

Open questions

Comment updates

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions