Skip to content

Reuse RDB file as AOF preamble on replica after disk-based full sync#1901

Merged
zuiderkwast merged 2 commits into
valkey-io:unstablefrom
RayaCoo:use-rdb-start-appendonly
Mar 11, 2026
Merged

Reuse RDB file as AOF preamble on replica after disk-based full sync#1901
zuiderkwast merged 2 commits into
valkey-io:unstablefrom
RayaCoo:use-rdb-start-appendonly

Conversation

@RayaCoo

@RayaCoo RayaCoo commented Mar 31, 2025

Copy link
Copy Markdown
Contributor

Summary

When a replica finishes a disk-based full sync and both appendonly yes and
aof-use-rdb-preamble yes are enabled, this PR reuses the received dump.rdb
as the new AOF base file, instead of triggering BGREWRITEAOF.

This avoids generating an almost identical base snapshot and removes redundant CPU/IO work.

Before

  1. Replica receives RDB via disk-based full sync
  2. Replica loads RDB into memory
  3. Replica may delete synced RDB (rdb-del-sync-files)
  4. BGREWRITEAOF runs and generates a new base snapshot
  5. New snapshot becomes AOF base

After

  1. Replica receives RDB via disk-based full sync
  2. Replica loads RDB into memory
  3. Synced RDB is renamed to the new AOF base file
  4. A new incremental AOF is created
  5. No BGREWRITEAOF is needed

Tests

  • Disk-based sync + preamble on: RDB reused as AOF base
  • New writes after sync: incr AOF works correctly
  • Replica restart: reused AOF loads correctly
  • Disk-based sync + preamble off: fallback to BGREWRITEAOF
  • Diskless sync + preamble on: fallback to BGREWRITEAOF
  • Diskless sync with stale local dump.rdb: stale file is not reused

@codecov

codecov Bot commented Mar 31, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (2b0a1c3) to head (5370cde).
⚠️ Report is 55 commits behind head on unstable.

Additional details and impacted files
@@       Coverage Diff        @@
##   unstable   #1901   +/-   ##
================================
================================
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arthurkiller

Copy link
Copy Markdown
Member

relying on slow failure-prone components like disks is not a good choice for a high-speed kv engine. For example, Redis 7.0 has been using diskless replication by default since then, which is exactly why they went with this approach

@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from 96eaf5a to 0be9fb9 Compare March 31, 2025 05:33
@RayaCoo

RayaCoo commented Mar 31, 2025

Copy link
Copy Markdown
Contributor Author

relying on slow failure-prone components like disks is not a good choice for a high-speed kv engine. For example, Redis 7.0 has been using diskless replication by default since then, which is exactly why they went with this approach

I understand your point about diskless replication. However, this PR only optimizes a specific configuration where disk-based replication is used with aof-use-rdb-preamble enabled. In this scenario, we're currently deleting an RDB file only to recreate an identical one immediately after. This PR simply eliminates this redundancy without changing any default behaviors.

@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from 5cf6232 to 0be9fb9 Compare March 31, 2025 05:57
@RayaCoo RayaCoo marked this pull request as draft March 31, 2025 05:59
@arthurkiller

Copy link
Copy Markdown
Member

relying on slow failure-prone components like disks is not a good choice for a high-speed kv engine. For example, Redis 7.0 has been using diskless replication by default since then, which is exactly why they went with this approach

I understand your point about diskless replication. However, this PR only optimizes a specific configuration where disk-based replication is used with aof-use-rdb-preamble enabled. In this scenario, we're currently deleting an RDB file only to recreate an identical one immediately after. This PR simply eliminates this redundancy without changing any default behaviors.

thx for your kind reply. I have checked the code and get what this pr want

@soloestoy soloestoy added enhancement New feature or request performance labels Mar 31, 2025

@soloestoy soloestoy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM, just need some minor changes.

Comment thread src/server.h Outdated
Comment thread src/aof.c Outdated
@soloestoy

Copy link
Copy Markdown
Member

reuse the sync file as the base AOF file can save lots of resource usage, @valkey-io/core-team PTAL.

@RayaCoo RayaCoo marked this pull request as ready for review April 1, 2025 03:53
@RayaCoo RayaCoo changed the title use rdb to restart AOF after primary-replica full sync reuse rdb file to restart AOF after primary-replica full sync Apr 1, 2025
Comment thread src/aof.c Outdated
@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from efa9a9f to 18f1bf9 Compare April 2, 2025 07:16
@madolson madolson moved this to Optional for next release candidate in Valkey 9.0 Jul 21, 2025
@madolson madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 18, 2025
@madolson madolson moved this from Todo to In Progress in Valkey 9.0 Aug 18, 2025
@rjd15372

rjd15372 commented Sep 1, 2025

Copy link
Copy Markdown
Member

With this change the RDB file that is re-used as the base snapshot of the AOF file, will have it's aof-base info field set to 0. I don't see any code logic in the core that depends on that value except for a logging message, and therefore it should be OK, but what if some external tool depends on the aof-base field value for its own logic? Is that a problem?

@zuiderkwast

Copy link
Copy Markdown
Contributor

This looks like a good idea, but unfortunately it's too late now to make it to 9.0 so I'm punting it to 9.1.

@soloestoy do you plan to follow up on this review?

@zuiderkwast zuiderkwast added the needs-review Maintainer attention label Sep 23, 2025
murphyjacob4 added a commit that referenced this pull request Oct 7, 2025
…ration (#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (#59 #1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
murphyjacob4 added a commit to murphyjacob4/valkey that referenced this pull request Oct 7, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
murphyjacob4 added a commit to murphyjacob4/valkey that referenced this pull request Oct 7, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
murphyjacob4 added a commit that referenced this pull request Oct 8, 2025
…ration (#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (#59 #1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Oct 14, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
@zuiderkwast zuiderkwast added the stalled No activity for a long time label Dec 10, 2025
@madolson

Copy link
Copy Markdown
Member

@soloestoy @RayaCoo Any update on the progress of this?

@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from 18f1bf9 to 2ecea1a Compare February 24, 2026 07:27
Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from 2ecea1a to ae1941d Compare February 24, 2026 07:41
@RayaCoo

RayaCoo commented Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

With this change the RDB file that is re-used as the base snapshot of the AOF file, will have it's aof-base info field set to 0. I don't see any code logic in the core that depends on that value except for a logging message, and therefore it should be OK, but what if some external tool depends on the aof-base field value for its own logic? Is that a problem?

Good point.

In current core logic, aof-base is informational only (used for logging during RDB load),
and correctness decisions are driven by the AOF manifest (base/incr/history), not by this AUX field.

So the reuse path is safe for core behavior. If any external tool currently depends on aof-base,
I agree we should treat that as a compatibility concern and either:

  1. update that tool to rely on manifest metadata, or
  2. add a follow-up compatibility patch.

I'd prefer to keep it simple right now. What do you think?

@RayaCoo

RayaCoo commented Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

@soloestoy @RayaCoo Any update on the progress of this?

Sorry for the long delay! I've rebased the PR onto the latest unstable branch and reworked the implementation to adapt to the upstream refactoring.

@soloestoy @arthurkiller could you please take another look?

Comment thread src/replication.c Outdated
Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
@RayaCoo RayaCoo force-pushed the use-rdb-start-appendonly branch from 7f4874e to 5370cde Compare March 2, 2026 08:25

@soloestoy soloestoy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, do you like to take a look? @zuiderkwast

Comment thread src/aof.c
@zuiderkwast zuiderkwast removed the stalled No activity for a long time label Mar 11, 2026
@zuiderkwast

Copy link
Copy Markdown
Contributor

mostly LGTM, do you like to take a look? @zuiderkwast

Not really. :) I have looked briefly but I don't know the AOF very well. I want to merge this so it can be included in 9.1.

@zuiderkwast zuiderkwast changed the title reuse rdb file to restart AOF after primary-replica full sync Reuse RDB file to restart AOF on replica after disk-based full sync Mar 11, 2026
@zuiderkwast zuiderkwast changed the title Reuse RDB file to restart AOF on replica after disk-based full sync Reuse RDB file as AOF preamble on replica after disk-based full sync Mar 11, 2026
@zuiderkwast zuiderkwast merged commit 4cbec4a into valkey-io:unstable Mar 11, 2026
56 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Valkey 9.1 Mar 11, 2026
@zuiderkwast zuiderkwast added release-notes This issue should get a line item in the release notes and removed needs-review Maintainer attention labels Mar 11, 2026
@RayaCoo RayaCoo deleted the use-rdb-start-appendonly branch March 12, 2026 03:34
madolson added a commit to madolson/valkey that referenced this pull request Mar 15, 2026
- Fix release date to Mon Mar 18 2026
- Fix typos: duplicate 'load', 'keyes' -> 'keys', duplicate 'INFO'
- Remove reverted contributor (arshidkv12, valkey-io#3137)
- Add 7 new release-notes entries from upstream/unstable merge:
  CLUSTERSCAN (valkey-io#2934), MSETEX (valkey-io#3121), availability-zone (valkey-io#3156),
  stream range optimization (valkey-io#3002), RDB as AOF preamble (valkey-io#1901),
  unsigned 64-bit module config (valkey-io#1546), fast_float -> ffc (valkey-io#3329)

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
JimB123 pushed a commit that referenced this pull request Mar 19, 2026
…1901)

When a replica finishes a **disk-based** full sync and both `appendonly
yes` and `aof-use-rdb-preamble yes` are enabled, this PR reuses the
received `dump.rdb` as the new AOF base file, instead of triggering
`BGREWRITEAOF`.

This avoids generating an almost identical base snapshot and removes
redundant CPU/IO work.

Before:

1. Replica receives RDB via disk-based full sync
2. Replica loads RDB into memory
3. Replica may delete synced RDB (`rdb-del-sync-files`)
4. `BGREWRITEAOF` runs and generates a new base snapshot
5. New snapshot becomes AOF base

After:

1. Replica receives RDB via disk-based full sync
2. Replica loads RDB into memory
3. Synced RDB is renamed to the new AOF base file
4. A new incremental AOF is created
5. No `BGREWRITEAOF` is needed

Tests:

- [x] Disk-based sync + preamble on: RDB reused as AOF base
- [x] New writes after sync: incr AOF works correctly
- [x] Replica restart: reused AOF loads correctly
- [x] Disk-based sync + preamble off: fallback to `BGREWRITEAOF`
- [x] Diskless sync + preamble on: fallback to `BGREWRITEAOF`
- [x] Diskless sync with stale local `dump.rdb`: stale file is not
       reused

---------

Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance release-notes This issue should get a line item in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants