-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Rdb channel replication #13732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rdb channel replication #13732
Conversation
|
LGTM (Reviewed internally) Ozan, nice work! |
I think now one replica one line, right? i think two connections together can represent a single replica. |
|
Yes, there is one replica, one line. Single logical replica. |
|
@ShooterIT how about |
|
|
|
@ShooterIT let's pick one. Little bit long but I guess |
Co-authored-by: debing.sun <debing.sun@redis.com>
During fullsync, before loading RDB on the replica, we stop aof child to prevent copy-on-write disaster. Once rdb is loaded, aof is started again and it will trigger aof rewrite. With #13732 , for rdbchannel replication, this behavior was changed. Currently, we start aof after replication buffer is streamed to db. This PR changes it back to start aof just after rdb is loaded (before repl buffer is streamed) Both approaches may have pros and cons. If we start aof before streaming repl buffers, we may still face with copy-on-write issues as repl buffers potentially include large amount of changes. If we wait until replication buffer drained, it means we are delaying starting aof persistence. Additional changes are introduced as part of this PR: - Interface change: Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command reply. During full sync, it shows total memory consumed by accumulated replication stream buffer on replica. Added same metric to `MEMORY STATS` command reply as `replica.fullsync.buffer` field. - Fixes: - Count repl stream buffer size of replica as part of 'memory overhead' calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs. Before this PR, repl buffer was not counted as part of memory overhead calculation, causing misreports for fields like `used_memory_overhead` and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field in "MEMORY STATS" command reply. - Dismiss replication stream buffers memory of replica in the fork to reduce COW impact during a fork. - Fixed a few time sensitive flaky tests, deleted a noop statement, fixed some comments and fail messages in rdbchannel tests.
|
@tezc @YaacovHazan i noticed that |
|
@oranagra I feel like we discussed around this before, maybe as part of adding accumulation buffer size to Now, I think we can just do: This is what you are suggesting right? |
|
Yes. I also remember discussing that area, i suppose when we added it to the memory overhead metric. Indeed, it should be in server.repl_buffer_mem, but it should probably be in that info field. So I think the line you proposed is the right one. |
should or should not? I see that we use |
|
sorry. |
Before #13732, replicas were brought online immediately after master wrote the last bytes of the RDB file to the socket. This behavior remains unchanged if rdbchannel replication is not used. However, with rdbchannel replication, the replica is brought online after receiving the first ack which is sent by replica after rdb is loaded. To align the behavior, reverting this change to put replica online once bgsave is done. Additonal changes: - INFO field `mem_total_replication_buffers` will also contain `server.repl_full_sync_buffer.mem_used` which shows accumulated replication stream during rdbchannel replication on replica side. - Deleted debug level logging from some replication tests. These tests generate thousands of keys and it may cause per key logging on some cases.
Now we have RDB channel in #13732, child process can transfer RDB in a background method, instead of handled by main thread. So when redis-cli gets RDB from server, we can adopt this way to reduce the main thread load. --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR is based on: redis#12109 valkey-io/valkey#60 Closes: redis#11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: valkey-io/valkey#837 valkey-io/valkey#1173 valkey-io/valkey#804 valkey-io/valkey#945 valkey-io/valkey#989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
During fullsync, before loading RDB on the replica, we stop aof child to prevent copy-on-write disaster. Once rdb is loaded, aof is started again and it will trigger aof rewrite. With redis#13732 , for rdbchannel replication, this behavior was changed. Currently, we start aof after replication buffer is streamed to db. This PR changes it back to start aof just after rdb is loaded (before repl buffer is streamed) Both approaches may have pros and cons. If we start aof before streaming repl buffers, we may still face with copy-on-write issues as repl buffers potentially include large amount of changes. If we wait until replication buffer drained, it means we are delaying starting aof persistence. Additional changes are introduced as part of this PR: - Interface change: Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command reply. During full sync, it shows total memory consumed by accumulated replication stream buffer on replica. Added same metric to `MEMORY STATS` command reply as `replica.fullsync.buffer` field. - Fixes: - Count repl stream buffer size of replica as part of 'memory overhead' calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs. Before this PR, repl buffer was not counted as part of memory overhead calculation, causing misreports for fields like `used_memory_overhead` and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field in "MEMORY STATS" command reply. - Dismiss replication stream buffers memory of replica in the fork to reduce COW impact during a fork. - Fixed a few time sensitive flaky tests, deleted a noop statement, fixed some comments and fail messages in rdbchannel tests.
Before redis#13732, replicas were brought online immediately after master wrote the last bytes of the RDB file to the socket. This behavior remains unchanged if rdbchannel replication is not used. However, with rdbchannel replication, the replica is brought online after receiving the first ack which is sent by replica after rdb is loaded. To align the behavior, reverting this change to put replica online once bgsave is done. Additonal changes: - INFO field `mem_total_replication_buffers` will also contain `server.repl_full_sync_buffer.mem_used` which shows accumulated replication stream during rdbchannel replication on replica side. - Deleted debug level logging from some replication tests. These tests generate thousands of keys and it may cause per key logging on some cases.
## <a name="overview"></a> Overview This PR is a joint effort with @ShooterIT . I’m just opening it on behalf of both of us. This PR introduces Atomic Slot Migration (ASM) for Redis Cluster — a new mechanism for safely and efficiently migrating hash slots between nodes. Redis Cluster distributes data across nodes using 16384 hash slots, each owned by a specific node. Sometimes slots need to be moved — for example, to rebalance after adding or removing nodes, or to mitigate a hot shard that’s overloaded. Before ASM, slot migration was non-atomic and client-dependent, relying on CLUSTER SETSLOT, GETKEYSINSLOT, MIGRATE commands, and client-side handling of ASK/ASKING replies. This process was complex, error-prone, slow and could leave clusters in inconsistent states after failures. Clients had to implement redirect logic, multi-key commands could fail mid-migration, and errors often resulted in orphaned keys or required manual cleanup. Several related discussions can be found in the issue list, some examples: #14300 , #4937 , #10370 , #4333 , #13122, #11312 Atomic Slot Migration (ASM) makes slot rebalancing safe, transparent, and reliable, addressing many of the limitations of the legacy migration method. Instead of moving keys one by one, ASM replicates the entire slot’s data plus live updates to the target node, then performs a single atomic handoff. Clients keep working without handling ASK/ASKING replies, multi-key operations remain consistent, failures don’t leave partial states, and replicas stay in sync. The migration process also completes significantly faster. Operators gain new commands (CLUSTER MIGRATION IMPORT, STATUS, CANCEL) for monitoring and control, while modules can hook into migration events for deeper integration. ### The problems of legacy method in detail Operators and developers ran into multiple issues with the legacy method, some of these issues in detail: 1. **Redirects and Client Complexity:** While a slot was being migrated, some keys were already moved while others were not. Clients had to handle `-ASK` and `-ASKING` responses, reissuing requests to the target node. Not all client libraries implemented this correctly, leading to failed commands or subtle bugs. Even when implemented, it increased latency and broke naive pipelines. 2. **Multi-Key Operations Became Unreliable:** Commands like `MGET key1 key2` could fail with `TRYAGAIN` if part of the slot was already migrated. This made application logic unpredictable during resharding. 3. **Risk of failure:** Keys were moved one-by-one (with MIGRATE command). If the source crashed, or the destination ran out of memory, the system could be left in an inconsistent state: some keys moved, others lost, slots partially migrated. Manual intervention was often needed, sometimes resulting in data loss. 4. **Replica and Failover Issues:** Replicas weren’t aware of migrations in progress. If a failover occurred mid-migration, manual intervention was required to clean up or resume the process safely. 5. **Operational Overhead:** Operators had to coordinate multiple commands (CLUSTER SETSLOT, MIGRATE, GETKEYSINSLOT, etc.) with little visibility into progress or errors, making rebalancing slow and error-prone. 6. **Poor performance:** Key-by-key migration was inherently slow and inefficient for large slot ranges. 7. **Large keys:** Large keys could fail to migrate or cause latency spikes on the destination node. ### How Atomic Slot Migration Fixes This Atomic Slot Migration (ASM) eliminates all of these issues by: 1. **Clients:** Clients no longer need to handle ASK/ASKING; the migration is fully transparent. 2. **Atomic ownership transfer:** The entire slot’s data (snapshot + live updates) is replicated and handed off in a single atomic step. 3. **Performance**: ASM completes migrations significantly faster by streaming slot data in parallel (snapshot + incremental updates) and eliminating key-by-key operations. 4. **Consistency guarantees:** Multi-key operations and pipelines continue to work reliably throughout migration. 5. **Resilience:** Failures no longer leave orphaned keys or partial states; migration tasks can be retried or safely cancelled. 6. **Replica awareness:** Replicas remain consistent during migration, and failovers will no longer leave partially imported keys. 7. **Operator visibility:** New CLUSTER MIGRATION subcommands (IMPORT, STATUS, CANCEL) provide clear observability and management for operators. ### ASM Diagram and Migration Steps ``` ┌─────────────┐ ┌────────────┐ ┌───────────┐ ┌───────────┐ ┌───────┐ │ │ │Destination │ │Destination│ │ Source │ │Source │ │ Operator │ │ master │ │ replica │ │ master │ │ Fork │ │ │ │ │ │ │ │ │ │ │ └──────┬──────┘ └─────┬──────┘ └─────┬─────┘ └─────┬─────┘ └───┬───┘ │ │ │ │ │ │ │ │ │ │ │CLUSTER MIGRATION IMPORT │ │ │ │ │ <start-slot> <end-slot>..│ │ │ │ ├───────────────────────────►│ │ │ │ │ │ │ │ │ │ Reply with <task-id> │ │ │ │ │◄───────────────────────────┤ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ CLUSTER SYNCSLOTS│SYNC │ │ │ CLUSTER MIGRATION STATUS │ <task-id> <start-slot> <end-slot>.│ │ Monitor │ ID <task-id> ├────────────────────────────────────►│ │ task ┌─►├───────────────────────────►│ │ │ │ state │ │ │ │ │ │ till │ │ Reply status │ Negotiation with multiple channels │ │ completed └─ │◄───────────────────────────┤ (i.e rdbchannel repl) │ │ │ │◄───────────────────────────────────►│ │ │ │ │ │ Fork │ │ │ │ ├──────────►│ ─┐ │ │ │ │ │ │ Slot snapshot as RESTORE commands │ │ │ │◄────────────────────────────────────────────────┤ │ │ Propagate │ │ │ │ ┌─────────────┐ ├─────────────────►│ │ │ │ │ │ │ │ │ │ │ Snapshot │ Client │ │ │ │ │ │ delivery │ │ │ Replication stream for slot range │ │ │ duration └──────┬──────┘ │◄────────────────────────────────────┤ │ │ │ │ Propagate │ │ │ │ │ ├─────────────────►│ │ │ │ │ │ │ │ │ │ │ SET key value1 │ │ │ │ │ ├─────────────────────────────────────────────────────────────────►│ │ │ │ +OK │ │ │ │ ─┘ │◄─────────────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ │ Drain repl stream │ ──┐ │ │ │◄────────────────────────────────────┤ │ │ │ SET key value2 │ │ │ │ │ ├─────────────────────────────────────────────────────────────────►│ │Write │ │ │ │ │ │pause │ │ │ │ │ │ │ │ │ Publish new config via cluster bus │ │ │ │ +MOVED ├────────────────────────────────────►│ ──┘ │ │◄─────────────────────────────────────────────────────────────────┤ ──┐ │ │ │ │ │ │ │ │ │ │ │ │Trim │ │ │ │ │ ──┘ │ │ SET key value2 │ │ │ │ ├───────────────────────────►│ │ │ │ │ +OK │ │ │ │ │◄───────────────────────────┤ │ │ │ │ │ │ │ │ │ │ │ │ │ ``` ### New commands introduced There are two new commands: 1. A command to start, monitor and cancel the migration operation: `CLUSTER MIGRATION <arg>` 2. An internal command to manage slot transfer between source and destination: `CLUSTER SYNCSLOTS <arg>` For more details, please refer to the [New Commands](#new-commands) section. Internal command messaging is mostly omitted in the diagram for simplicity. ### Steps 1. Slot migration begins when the operator sends `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> ...` to the destination master. The process is initiated from the destination node, similar to REPLICAOF. This approach allows us to reuse the same logic and share code with the new replication mechanism (see #13732). The command can include multiple slot ranges. The destination node creates one migration task per source node, regardless of how many slot ranges are specified. Upon successfully creating the task, the destination node replies IMPORT command with the assigned task ID. The operator can then monitor progress using `CLUSTER MIGRATION STATUS ID <task-id>` . When the task’s state field changes to `completed`, the migration has finished successfully. Please see [New Commands](#new-commands) section for the output sample. 2. After creating the migration task, the destination node will request replication of slots by using the internal command `CLUSTER SYNCSLOTS`. 3. Once the source node accepts the request, the destination node establishes another separate connection(similar to rdbchannel replication) so snapshot data and incremental changes can be transmitted in parallel. 4. Source node forks, starts delivering snapshot content (as per-key RESTORE commands) from one connection and incremental changes from the other connection. The destination master starts applying commands from the snapshot connection and accumulates incremental changes. Applied commands are also propagated to the destination replicas via replication backlog. Note: Only commands of related slots are delivered to the destination node. This is done by writing them to the migration client’s output buffer, which serves as the replication stream for the migration operation. 5. Once the source node finishes delivering the snapshot and determines that the destination node has caught up (remaining repl stream to consume went under a configured limit), it pauses write traffic for the entire server. After pausing the writes, the source node forwards any remaining write commands to the destination node. 6. Once the destination consumes all the writes, it bumps up cluster config epoch and changes the configuration. New config is published via cluster bus. 7. When the source node receives the new configuration, it can redirect clients and it begins trimming the migrated slots, while also resuming write traffic on the server. ### Internal slots synchronization state machine  1. The destination node performs authentication using the cluster secret introduced in #13763 , and transmits its node ID information. 2. The destination node sends `CLUSTER SYNCSLOTS SYNC <task-id> <start-slot> <end-slot>` to initiate a slot synchronization request and establish the main channel. The source node responds with `+RDBCHANNELSYNCSLOTS`, indicating that the destination node should establish an RDB channel. 3. The destination node then sends `CLUSTER SYNCSLOTS RDBCHANNEL <task-id>` to establish the RDB channel, using the same task-id as in the previous step to associate the two connections as part of the same ASM task. The source node replies with `+SLOTSSNAPSHOT`, and `fork` a child process to transfer slot snapshot. 4. The destination node applies the slot snapshot data received over the RDB channel, while proxying the command stream to replicas. At the same time, the main channel continues to read and buffer incremental commands in memory. 5. Once the source node finishes sending the slot snapshot, it notifies the destination node using the `CLUSTER SYNCSLOTS SNAPSHOT-EOF` command. The destination node then starts streaming the buffered commands while continuing to read and buffer incremental commands sent from the source. 6. The destination node periodically sends `CLUSTER SYNCSLOTS ACK <offset>` to inform the source of the applied data offset. When the offset gap meets the threshold, the source node pauses write operations. After all buffered data has been drained, it sends `CLUSTER SYNCSLOTS STREAM-EOF` to the destination node to hand off slots. 7. Finally, the destination node takes over slot ownership, updates the slot configuration and bumps the epoch, then broadcasts the updates via cluster bus. Once the source node detects the updated slot configuration, the slot migration process is complete. ### Error handling - If the connection between the source and destination is lost (due to disconnection, output buffer overflow, OOM, or timeout), the destination node automatically restarts the migration from the beginning. The destination node will retry the operation until it is explicitly cancelled using the CLUSTER MIGRATION CANCEL <task-id> command. - If a replica connection drops during migration, it can later resume with PSYNC, since the imported slot data is also written to the replication backlog. - During the write pause phase, the source node sets a timeout. If the destination node fails to drain remaining replication data and update the config during that time, the source node assumes the destination has failed and automatically resumes normal writes for the migrating slots. - On any error, the destination node triggers a trim operation to discard any partially imported slot data. - If node crashes during importing, unowned keys are deleted on start up. ### <a name="slot-snapshot-format-considerations"></a> Slot Snapshot Format Considerations When the source node forks to deliver slot content, in theory, there are several possible formats for transmitting the snapshot data: **Mini RDB**:A compact RDB file containing only the keys from the migrating slots. This format is efficient for transmission, but it cannot be easily forwarded to destination-side replicas. **AOF format**: The source node can generate commands in AOF form (e.g., SET x y, HSET h f v) and stream them. Individual commands are easily appended to the replication stream and propagated to replicas. Large keys can also be split into multiple commands (incrementally reconstructing the value), similar to the AOF rewrite process. **RESTORE commands**: Each key is serialized and sent as a `RESTORE` command. These can be appended directly to the destination’s replication stream, though very large keys may make serialization and transmission less efficient. We chose the `RESTORE` command as default approach for the following reasons: - It can be easily propagated to replicas. - It is more efficient than AOF for most cases, and some module keys do not support the AOF format. - For large **non-module** keys that are not string, ASM automatically switches to the AOF-based key encoding as an optimization when the key’s cardinality exceeds 512. This approach allows the key to be transferred in chunks rather than as a single large payload, reducing memory pressure and improving migration efficiency. In future versions, the RESTORE command may be enhanced to handle large keys more efficiently. Some details: - For RESTORE commands, normally by default Redis compresses keys. We disable compression while delivering RESTORE commands as compression comes with a performance hit. Without compression, replication is several times faster. - For string keys, we still prefer AOF format, e.g. SET commands as it is currently more efficient than RESTORE, especially for big keys. ### <a name="trimming-the-keys"></a> Trimming the keys When a migration completes successfully, the source node deletes the migrated keys from its local database. Since the migrated slots may contain a large number of keys, this trimming process must be efficient and non-blocking. In cluster mode, Redis maintains per-slot data structures for keys, expires, and subexpires. This organization makes it possible to efficiently detach all data associated with a given slot in a single step. During trimming, these slot-specific data structures are handed off to a background I/O (BIO) thread for asynchronous cleanup—similar to how FLUSHALL or FLUSHDB operate. This mechanism is referred to as background trimming, and it is the preferred and default method for ASM, ensuring that the main thread remains unblocked. However, unlike Redis itself, some modules may not maintain per-slot data structures and therefore cannot drop related slots data in a single operation. To support these cases, Redis introduces active trimming, where key deletion occurs in the main thread instead. This is not a blocking operation, trimming runs concurrently in the main thread, periodically removing keys during the cron loop. Each deletion triggers a keyspace notification so that modules can react to individual key removals. While active trim is less efficient, it ensures backward compatibility for modules during the transition period. Before starting the trim, Redis checks whether any module is subscribed to newly added `REDISMODULE_NOTIFY_KEY_TRIMMED` keyspace event. If such subscribers exist, active trimming is used; otherwise, background trimming is triggered. Going forward, modules are expected to adopt background trimming to take advantage of its performance and scalability benefits, and active trimming will be phased out once modules migrate to the new model. Redis also prefers active trimming if there is any client that is using client tracking feature (see [client-side caching](https://redis.io/docs/latest/develop/reference/client-side-caching/)). In the current client tracking protocol, when a database is flushed (e.g., via the FLUSHDB command), a null value is sent to tracking clients to indicate that they should invalidate all locally cached keys. However, there is currently no mechanism to signal that only specific slots have been flushed. Iterating over all keys in the slots to be trimmed would be a blocking operation. To avoid this, if there is any client that is using client tracking feature, Redis automatically switches to active trimming mode. In the future, the client tracking protocol can be extended to support slot-based invalidation, allowing background trimming to be used in these cases as well. Finally, trimming may also be triggered after a migration failure. In such cases, the operation ensures that any partially imported or inconsistent slot data is cleaned up, maintaining cluster consistency and preventing stale keys from remaining in the source or destination nodes. Note about active trim: Subsequent migrations can complete while a prior trim is still running. In that case, the new migration’s trim job is queued and will start automatically after the current trim finishes. This does not affect slot ownership or client traffic—it only serializes the background cleanup. ### <a name="replica-handling"></a> Replica handling - During importing, new keys are propagated to destination side replica. Replica will check slot ownership before replying commands like SCAN, KEYS, DBSIZE not to include these unowned keys in the reply. Also, when an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This is done by the internal `CLUSTER SYNCSLOTS CONF ASM-TASK` command in the replication stream. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support full sync with RDB delivery scenarios, a new AUX field is also added to the RDB: `cluster-asm-task`. It's value is a string in the format of `task_id:source_node:dest_node:operation:state:slot_ranges`. - After a successful migration or on a failed import, master will trim the keys. In that case, master will propagate a new command to the replica: `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ... ` . So, the replica will start trimming once this command is received. ### <a name="propagating-data-outside-the-keyspace"></a> Propagating data outside the keyspace When the destination node is newly added to the cluster, certain data outside the keyspace may need to be propagated first. A common example is functions. Previously, redis-cli handled this by transferring functions when a new node was added. With ASM, Redis now automatically dumps and sends functions to the destination node using `FUNCTION RESTORE ..REPLACE` command — done purely for convenience to simplify setup. Additionally, modules may also need to propagate their own data outside the keyspace. To support this, a new API has been introduced: `RM_ClusterPropagateForSlotMigration()`. See the [Module Support](#module-support) section for implementation details. ### Limitations 1. Single migration at a time: Only one ASM migration operation is allowed at a time. This limitation simplifies the current design but can be extended in the future. 2. Large key handling: For large keys, ASM switches to AOF encoding to deliver key data in chunks. This mechanism currently applies only to non-module keys. In the future, the RESTORE command may be extended to support chunked delivery, providing a unified solution for all key types. See [Slot Snapshot Format Considerations](#slot-snapshot-format-considerations) for details. 3. There are several cases that may cause an Atomic Slot Migration (ASM) to be aborted (can be retried afterwards): - FLUSHALL / FLUSHDB: These commands introduce complexity during ASM. For example, if executed on the migrating node, they must be propagated only for the migrating slots. However, when combined with active trimming, their execution may need to be deferred until it is safe to proceed, adding further complexity to the process. - FAILOVER: The replica cannot resume the migration process. Migration should start from the beginning. - Module propagates cross-slot command during ASM via RM_Replicate(): If this occurs on the migrating node, Redis cannot split the command to propagate only the relevant slots to the ASM destination. To keep the logic simple and consistent, ASM is cancelled in this case. Modules should avoid propagating cross-slot commands during migration. - CLIENT PAUSE: The import task cannot progress during a write pause, as doing so would violate the guarantee that no writes occur during migration. To keep things simple, the ASM task is aborted when CLIENT PAUSE is active. - Manual Slot Configuration Changes: If slot configuration is modified manually during ASM (for example, when legacy migration methods are mixed with ASM), the process is aborted. Note: This situation is highly unexpected — users should not combine ASM with legacy migration methods. 4. When active trimming is enabled, a node must not re-import the same slots while trimming for those slots is still in progress. Otherwise, it can’t distinguish newly imported keys from pre-existing ones, and the trim cron might delete the incoming keys by mistake. In this state, the node rejects IMPORT operation for those slots until trimming completes. If the master has finished trimming but a replica is still trimming, master may still start the import operation for those slots. So, the replica checks whether the master is sending commands for those slots; if so, it blocks the master’s client connection until trimming finishes. This is a corner case, but we believe the behavior is reasonable for now. In the worst case, the master may drop the replica (e.g., buffer overrun), triggering a new full sync. # API Changes ## <a name="new-commands"></a> New Commands ### Public commands 1. **Syntax:** `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> [<start-slot> <end-slot>]...` **Args:** Slot ranges **Reply:** - String task ID - -ERR <message> on failure (e.g. invalid slot range) **Description:** Executes on the destination master. Accepts multiple slot ranges and triggers atomic migration for the specified ranges. Returns a task ID that can be used to monitor the status of the task. In CLUSTER MIGRATION STATUS output, “state” field will be `completed` on a successful operation. 2. **Syntax:** `CLUSTER MIGRATION CANCEL [ID <id> | ALL]` **Args:** Task ID or ALL **Reply:** Number of cancelled tasks **Description:** Cancels an ongoing migration task by its ID or cancels all tasks if ALL is specified. Note: Cancelling a task on the source node does not stop the migration on the destination node, which will continue retrying until it is also cancelled there. 3. **Syntax:** `CLUSTER MIGRATION STATUS [ID <id> | ALL]` **Args:** Task ID or ALL - **ID:** If provided, returns the status of the specified migration task. - **ALL:** Lists the status of all migration tasks. **Reply:** - A list of migration task details (both ongoing and completed ones). - Empty list if the given task ID does not exist. **Description:** Displays the status of all current and completed atomic slot migration tasks. If a specific task ID is provided, it returns detailed information for that task only. **Sample output:** ``` 127.0.0.1:5001> cluster migration status all 1) 1) "id" 2) "24cf41718b20f7f05901743dffc40bc9b15db339" 3) "slots" 4) "0-1000" 5) "source" 6) "1098d90d9ba2d1f12965442daf501ef0b6667bec" 7) "dest" 8) "b3b5b426e7ea6166d1548b2a26e1d5adeb1213ac" 9) "operation" 10) "migrate" 11) "state" 12) "completed" 13) "last_error" 14) "" 15) "retries" 16) "0" 17) "create_time" 18) "1759694528449" 19) "start_time" 20) "1759694528449" 21) "end_time" 22) "1759694528464" 23) "write_pause_ms" 24) "10" ``` ### Internal commands 1. **Syntax:** `CLUSTER SYNCSLOTS <arg> ...` **Args:** Internal messaging operations **Reply:** +OK or -ERR <message> on failure (e.g. invalid slot range) **Description:** Used for internal communication between source and destination nodes. e.g. handshaking, establishing multiple channels, triggering handoff. 2. **Syntax:** `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ...` **Args:** Slot ranges to trim **Reply:** +OK **Description:** Master propagates it to replica so that replica can trim unowned keys after a successful migration or on a failed import. ## New configs - `cluster-slot-migration-max-archived-tasks`: To list in `CLUSTER MIGRATION STATUS ALL` output, Redis keeps last n migration tasks in memory. This config controls maximum number of archived ASM tasks. Default value: 32, used as a hidden config - `cluster-slot-migration-handoff-max-lag-bytes`: After the slot snapshot is completed, if the remaining replication stream size falls below this threshold, the source node pauses writes to hand off slot ownership. A higher value may trigger the handoff earlier but can lead to a longer write pause, since more data remains to be replicated. A lower value can result in a shorter write pause, but it may be harder to reach the threshold if there is a steady flow of incoming writes. Default value: 1MB - `cluster-slot-migration-write-pause-timeout`: The maximum duration (in milliseconds) that the source node pauses writes during ASM handoff. After pausing writes, if the destination node fails to take over the slots within this timeout (for example, due to a cluster configuration update failure), the source node assumes the migration has failed and resumes writes to prevent indefinite blocking. Default value: 10 seconds - `cluster-slot-migration-sync-buffer-drain-timeout`: Timeout in milliseconds for sync buffer to be drained during ASM. After the destination applies the accumulated buffer, the source continues sending commands for migrating slots. The destination keeps applying them, but if the gap remains above the acceptable limit (see `slot-migration-handoff-max-lag-bytes`), which may cause endless synchronization. A timeout check is required to handle this case. The timeout is calculated as **the maximum of two values**: - A configurable timeout (slot-migration-sync-buffer-drain-timeout) to avoid false positives. - A dynamic timeout based on the time that the destination took to apply the slot snapshot and the accumulated buffer during slot snapshot delivery. The destination should be able to drain the remaining sync buffer in less time than this. We multiply it by 2 to be more conservative. Default value: 60000 millliseconds, used as a hidden config ## New flag in CLIENT LIST - the client responsible for importing slots is marked with the `o` flag. - the client responsible for migrating slots is marked with the `g` flag. ## New INFO fields - `mem_cluster_slot_migration_output_buffer`: Memory usage of the migration client’s output buffer. Redis writes incoming changes to this buffer during the migration process. - `mem_cluster_slot_migration_input_buffer`: Memory usage of the accumulated replication stream buffer on the importing node. - `mem_cluster_slot_migration_input_buffer_peak`: Peak accumulated repl buffer size on the importing side ## New CLUSTER INFO fields - `cluster_slot_migration_active_tasks`: Number of in-progress ASM tasks. Currently, it will be 1 or 0. - `cluster_slot_migration_active_trim_running`: Number of active trim jobs in progress and scheduled - `cluster_slot_migration_active_trim_current_job_keys`: Number of keys scheduled for deletion in the current trim job. - `cluster_slot_migration_active_trim_current_job_trimmed`: Number of keys already deleted in the current trim job. - `cluster_slot_migration_stats_active_trim_started`: Total number of trim jobs that have started since the process began. - `cluster_slot_migration_stats_active_trim_completed`: Total number of trim jobs completed since the process began. - `cluster_slot_migration_stats_active_trim_cancelled`: Total number of trim jobs cancelled since the process began. ## Changes in RDB format A new aux field is added to RDB: `cluster-asm-task`. When an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support RDB delivery scenarios, a new field is added to the RDB. See [replica handling](#replica-handling) ## Bug fix - Fix memory leak when processing forgetting node type message - Fix data race of writing reply to replica client directly when enabling multi-threading We don't plan to back point them into old versions, since they are very rare cases. ## Keys visibility When performing atomic slot migration, during key importing on the destination node or key trimming on the source/destination, these keys will be filtered out in the following commands: - KEYS - SCAN - RANDOMKEY - CLUSTER GETKEYSINSLOT - DBSIZE - CLUSTER COUNTKEYSINSLOT The only command that will reflect the increasing number of keys is: - INFO KEYSPACE ## <a name="module-support"></a> Module Support **NOTE:** Please read [trimming](#trimming-the-keys) section and see how does ASM decide about trimming method when there are modules in use. ### New notification: ```c #define REDISMODULE_NOTIFY_KEY_TRIMMED (1<<17) ``` When a key is deleted by the active trim operation, this notification will be sent to subscribed modules. Also, ASM will automatically choose the trimming method depending on whether there are any subscribers to this new event. Please see the further details here: [trimming](#trimming-the-keys) ### New struct in the API: ```c typedef struct RedisModuleSlotRange { uint16_t start; uint16_t end; } RedisModuleSlotRange; typedef struct RedisModuleSlotRangeArray { int32_t num_ranges; RedisModuleSlotRange ranges[]; } RedisModuleSlotRangeArray; ``` ### New Events #### 1. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION (RedisModuleEvent_ClusterSlotMigration) These events notify modules about different stages of Active Slot Migration (ASM) operations such as when import or migration starts, fails, or completes. Modules can use these notifications to track cluster slot movements or perform custom logic during ASM transitions. ```c #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED 0 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED 1 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED 2 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED 3 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED 4 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED 5 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE 6 ``` Parameter to these events: ```c typedef struct RedisModuleClusterSlotMigrationInfo { uint64_t version; /* Not used since this structure is never passed from the module to the core right now. Here for future compatibility. */ char source_node_id[REDISMODULE_NODE_ID_LEN + 1]; char destination_node_id[REDISMODULE_NODE_ID_LEN + 1]; const char *task_id; RedisModuleSlotRangeArray* slots; } RedisModuleClusterSlotMigrationInfoV1; #define RedisModuleClusterSlotMigrationInfo RedisModuleClusterSlotMigrationInfoV1 ``` #### 2. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM (RedisModuleEvent_ClusterSlotMigrationTrim) These events inform modules about the lifecycle of ASM key trimming operations. Modules can use them to detect when trimming starts, completes, or is performed asynchronously in the background. ```c #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED 0 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED 1 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND 2 ``` Parameter to these events: ```c typedef struct RedisModuleClusterSlotMigrationTrimInfo { uint64_t version; /* Not used since this structure is never passed from the module to the core right now. Here for future compatibility. */ RedisModuleSlotRangeArray* slots; } RedisModuleClusterSlotMigrationTrimInfoV1; #define RedisModuleClusterSlotMigrationTrimInfo RedisModuleClusterSlotMigrationTrimInfoV1 ``` ### New functions ```c /* Returns 1 if keys in the specified slot can be accessed by this node, 0 otherwise. * * This function returns 1 in the following cases: * - The slot is owned by this node or by its master if this node is a replica * - The slot is being imported under the old slot migration approach (CLUSTER SETSLOT <slot> IMPORTING ..) * - Not in cluster mode (all slots are accessible) * * Returns 0 for: * - Invalid slot numbers (< 0 or >= 16384) * - Slots owned by other nodes */ int RM_ClusterCanAccessKeysInSlot(int slot); /* Propagate commands along with slot migration. * * This function allows modules to add commands that will be sent to the * destination node before the actual slot migration begins. It should only be * called during the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event. * * This function can be called multiple times within the same event to * replicate multiple commands. All commands will be sent before the * actual slot data migration begins. * * Note: This function is only available in the fork child process just before * slot snapshot delivery begins. * * On success REDISMODULE_OK is returned, otherwise * REDISMODULE_ERR is returned and errno is set to the following values: * * * EINVAL: function arguments or format specifiers are invalid. * * EBADF: not called in the correct context, e.g. not called in the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event. * * ENOENT: command does not exist. * * ENOTSUP: command is cross-slot. * * ERANGE: command contains keys that are not within the migrating slot range. */ int RM_ClusterPropagateForSlotMigration(RedisModuleCtx *ctx, const char *cmdname, const char *fmt, ...); /* Returns the locally owned slot ranges for the node. * * An optional `ctx` can be provided to enable auto-memory management. * If cluster mode is disabled, the array will include all slots (0–16383). * If the node is a replica, the slot ranges of its master are returned. * * The returned array must be freed with RM_ClusterFreeSlotRanges(). */ RedisModuleSlotRangeArray *RM_ClusterGetLocalSlotRanges(RedisModuleCtx *ctx); /* Frees a slot range array returned by RM_ClusterGetLocalSlotRanges(). * Pass the `ctx` pointer only if the array was created with a context. */ void RM_ClusterFreeSlotRanges(RedisModuleCtx *ctx, RedisModuleSlotRangeArray *slots); ``` ## ASM API for alternative cluster implementations Following #12742, Redis cluster code was restructured to support alternative cluster implementations. Redis uses cluster_legacy.c implementation by default. This PR adds a generic ASM API so alternative implementations can initiate and coordinate Atomic Slot Migration (ASM) while Redis executes the data movement and emits state changes. Documentation rests in `cluster.h`: ```c There are two new functions: /* Called by cluster implementation to request an ASM operation. (cluster impl --> redis) */ int clusterAsmProcess(const char *task_id, int event, void *arg, char **err); /* Called when an ASM event occurs to notify the cluster implementation. (redis --> cluster impl) */ int clusterAsmOnEvent(const char *task_id, int event, void *arg); ``` ```c /* API for alternative cluster implementations to start and coordinate * Atomic Slot Migration (ASM). * * These two functions drive ASM for alternative cluster implementations. * - clusterAsmProcess(...) impl -> redis: initiates/advances/cancels ASM operations * - clusterAsmOnEvent(...) redis -> impl: notifies state changes * * Generic steps for an alternative implementation: * - On destination side, implementation calls clusterAsmProcess(ASM_EVENT_IMPORT_START) * to start an import operation. * - Redis calls clusterAsmOnEvent() when an ASM event occurs. * - On the source side, Redis will call clusterAsmOnEvent(ASM_EVENT_HANDOFF_PREP) * when slots are ready to be handed off and the write pause is needed. * - Implementation stops the traffic to the slots and calls clusterAsmProcess(ASM_EVENT_HANDOFF) * - On the destination side, Redis calls clusterAsmOnEvent(ASM_EVENT_TAKEOVER) * when destination node is ready to take over the slot, waiting for ownership change. * - Cluster implementation updates the config and calls clusterAsmProcess(ASM_EVENT_DONE) * to notify Redis that the slots ownership has changed. * * Sequence diagram for import: * - Note: shows only the events that cluster implementation needs to react. * * ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ * │ Destination │ │ Destination │ │ Source │ │ Source │ * │ Cluster impl │ │ Master │ │ Master │ │ Cluster impl │ * └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ * │ │ │ │ * │ ASM_EVENT_IMPORT_START │ │ │ * ├─────────────────────────────►│ │ │ * │ │ CLUSTER SYNCSLOTS <arg> │ │ * │ ├────────────────────────►│ │ * │ │ │ │ * │ │ SNAPSHOT(restore cmds) │ │ * │ │◄────────────────────────┤ │ * │ │ Repl stream │ │ * │ │◄────────────────────────┤ │ * │ │ │ ASM_EVENT_HANDOFF_PREP │ * │ │ ├────────────────────────────►│ * │ │ │ ASM_EVENT_HANDOFF │ * │ │ │◄────────────────────────────┤ * │ │ Drain repl stream │ │ * │ │◄────────────────────────┤ │ * │ ASM_EVENT_TAKEOVER │ │ │ * │◄─────────────────────────────┤ │ │ * │ │ │ │ * │ ASM_EVENT_DONE │ │ │ * ├─────────────────────────────►│ │ ASM_EVENT_DONE │ * │ │ │◄────────────────────────────┤ * │ │ │ │ */ #define ASM_EVENT_IMPORT_START 1 /* Start a new import operation (destination side) */ #define ASM_EVENT_CANCEL 2 /* Cancel an ongoing import/migrate operation (source and destination side) */ #define ASM_EVENT_HANDOFF_PREP 3 /* Slot is ready to be handed off to the destination shard (source side) */ #define ASM_EVENT_HANDOFF 4 /* Notify that the slot can be handed off (source side) */ #define ASM_EVENT_TAKEOVER 5 /* Ready to take over the slot, waiting for config change (destination side) */ #define ASM_EVENT_DONE 6 /* Notify that import/migrate is completed, config is updated (source and destination side) */ #define ASM_EVENT_IMPORT_PREP 7 /* Import is about to start, the implementation may reject by returning C_ERR */ #define ASM_EVENT_IMPORT_STARTED 8 /* Import started */ #define ASM_EVENT_IMPORT_FAILED 9 /* Import failed */ #define ASM_EVENT_IMPORT_COMPLETED 10 /* Import completed (config updated) */ #define ASM_EVENT_MIGRATE_PREP 11 /* Migrate is about to start, the implementation may reject by returning C_ERR */ #define ASM_EVENT_MIGRATE_STARTED 12 /* Migrate started */ #define ASM_EVENT_MIGRATE_FAILED 13 /* Migrate failed */ #define ASM_EVENT_MIGRATE_COMPLETED 14 /* Migrate completed (config updated) */ ``` ------ Co-authored-by: Yuan Wang <yuan.wang@redis.com> --------- Co-authored-by: Yuan Wang <yuan.wang@redis.com>
This PR is based on:
#12109
valkey-io/valkey#60
Closes: #11678
Motivation
During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure.
The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB.
This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding
it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step.
In summary, replication will be faster and master's performance during full syncs will improve.
Implementation steps
Some details
repl-diskless-syncis enabled on master. Otherwise, replication will happen over a single connection as in before.replica-full-sync-buffer-limitto limit number of bytes to accumulate. If it is not set, replica inheritsclient-output-buffer-limit <replica>hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection.API changes in INFO output:
send_bulk_and_stream. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel.Replica state changes in steps:
state=wait_bgsavestate=send_bulk_and_streamstate=onlineAPI changes in CLIENT LIST
In
client listoutput, rdbchannel clients will have 'C' flag in addition to 'S' replica flag:Config changes:
replica-full-sync-buffer-limit: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inheritclient-output-buffer-limit <replica>hard limit config to limit accumulated data.repl-rdb-channelconfig is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability)Internal API changes:
Changes that were introduced to Redis replication:
capa rdb-channel-repl. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities.+RDBCHANNELSYNC <client-id>to the replica's PSYNC request.rdb-channel 1to let master know this is rdb channel. Also, it sendsmain-ch-client-id <client-id>as part of replconf command so master can associate channels.Testing:
As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with
repl-rdb-channel yes/noconfig.Replica state diagram
This PR also contains changes and ideas from:
valkey-io/valkey#837
valkey-io/valkey#1173
valkey-io/valkey#804
valkey-io/valkey#945
valkey-io/valkey#989
Co-authored-by: Yuan Wang wangyuancode@163.com
Co-authored-by: debing.sun debing.sun@redis.com
Co-authored-by: Moti Cohen moticless@gmail.com
Co-authored-by: naglera anagler123@gmail.com
Co-authored-by: Amit Nagler 58042354+naglera@users.noreply.github.com
Co-authored-by: Madelyn Olson madelyneolson@gmail.com
Co-authored-by: Binbin binloveplay1314@qq.com
Co-authored-by: Viktor Söderqvist viktor.soderqvist@est.tech
Co-authored-by: Ping Xie pingxie@outlook.com
Co-authored-by: Ran Shidlansik ranshid@amazon.com
Co-authored-by: ranshid 88133677+ranshid@users.noreply.github.com
Co-authored-by: xbasel 103044017+xbasel@users.noreply.github.com