Prevent exposure of importing keys on replicas during atomic slot migration#2635
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2635 +/- ##
============================================
+ Coverage 72.18% 72.26% +0.07%
============================================
Files 128 128
Lines 71037 71237 +200
============================================
+ Hits 51277 51477 +200
Misses 19760 19760
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
LGTM, but I'm not fully familiar with the ASM code in general. I didn't review the main ASM PR carefully enough.
|
@valkey-io/valkey-committers Another pair of eyes on this wouldn't hurt. If anyone has time. It's needed for 9.0.0 to be released ASAP. |
madolson
left a comment
There was a problem hiding this comment.
I'm mostly aligned with the high level details. I'd like us to trim down the data in the RDB. The implementation seems pretty clean otherwise.
|
Hey @madolson, thanks for the review, please take another look! |
2f5c997 to
26f9fb0
Compare
madolson
left a comment
There was a problem hiding this comment.
I think everything else looks fine outside some of the questions around the RDB format.
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…rom aux Signed-off-by: Jacob Murphy <jkmurphy@google.com>
6625e3a to
7324270
Compare
|
@madolson we now don't track source node ID on the replica, and we use an RDB opcode to save the slot imports. PTAL! |
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ration (valkey-io#2635) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | | | | |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| | | |~~~~~~ forward snapshot ~~~~~~~~>| |----------- SYNCSLOTS SNAPSHOT-EOF ------------>| | | | | |<----------- SYNCSLOTS REQUEST-PAUSE -----------| | | | | |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| | | |~~~~~~ forward changes ~~~~~~~~~>| |--------------- SYNCSLOTS PAUSED -------------->| | | | | |<---------- SYNCSLOTS REQUEST-FAILOVER ---------| | | | | |---------- SYNCSLOTS FAILOVER-GRANTED --------->| | | | | | (performs takeover & | | propagates topology) | | | | | |------- SYNCSLOTS FINISH ------->| (finds out about topology | | change & marks migration done) | | | | | ``` ## Failure Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | ... ... ... | | | | <FAILURE> | | | | | (performs cleanup) | | | ~~~~~~ UNLINK <key> ... ~~~~~~~>| | | | | | ------ SYNCSLOTS FINISH ------->| | | | ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (valkey-io#59 valkey-io#1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ration (valkey-io#2635) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | | | | |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| | | |~~~~~~ forward snapshot ~~~~~~~~>| |----------- SYNCSLOTS SNAPSHOT-EOF ------------>| | | | | |<----------- SYNCSLOTS REQUEST-PAUSE -----------| | | | | |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| | | |~~~~~~ forward changes ~~~~~~~~~>| |--------------- SYNCSLOTS PAUSED -------------->| | | | | |<---------- SYNCSLOTS REQUEST-FAILOVER ---------| | | | | |---------- SYNCSLOTS FAILOVER-GRANTED --------->| | | | | | (performs takeover & | | propagates topology) | | | | | |------- SYNCSLOTS FINISH ------->| (finds out about topology | | change & marks migration done) | | | | | ``` ## Failure Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | ... ... ... | | | | <FAILURE> | | | | | (performs cleanup) | | | ~~~~~~ UNLINK <key> ... ~~~~~~~>| | | | | | ------ SYNCSLOTS FINISH ------->| | | | ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (valkey-io#59 valkey-io#1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ration (#2635) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | | | | |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| | | |~~~~~~ forward snapshot ~~~~~~~~>| |----------- SYNCSLOTS SNAPSHOT-EOF ------------>| | | | | |<----------- SYNCSLOTS REQUEST-PAUSE -----------| | | | | |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| | | |~~~~~~ forward changes ~~~~~~~~~>| |--------------- SYNCSLOTS PAUSED -------------->| | | | | |<---------- SYNCSLOTS REQUEST-FAILOVER ---------| | | | | |---------- SYNCSLOTS FAILOVER-GRANTED --------->| | | | | | (performs takeover & | | propagates topology) | | | | | |------- SYNCSLOTS FINISH ------->| (finds out about topology | | change & marks migration done) | | | | | ``` ## Failure Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | ... ... ... | | | | <FAILURE> | | | | | (performs cleanup) | | | ~~~~~~ UNLINK <key> ... ~~~~~~~>| | | | | | ------ SYNCSLOTS FINISH ------->| | | | ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (#59 #1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ration (valkey-io#2635) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | | | | |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| | | |~~~~~~ forward snapshot ~~~~~~~~>| |----------- SYNCSLOTS SNAPSHOT-EOF ------------>| | | | | |<----------- SYNCSLOTS REQUEST-PAUSE -----------| | | | | |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| | | |~~~~~~ forward changes ~~~~~~~~~>| |--------------- SYNCSLOTS PAUSED -------------->| | | | | |<---------- SYNCSLOTS REQUEST-FAILOVER ---------| | | | | |---------- SYNCSLOTS FAILOVER-GRANTED --------->| | | | | | (performs takeover & | | propagates topology) | | | | | |------- SYNCSLOTS FINISH ------->| (finds out about topology | | change & marks migration done) | | | | | ``` ## Failure Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | ... ... ... | | | | <FAILURE> | | | | | (performs cleanup) | | | ~~~~~~ UNLINK <key> ... ~~~~~~~>| | | | | | ------ SYNCSLOTS FINISH ------->| | | | ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (valkey-io#59 valkey-io#1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…on (#2749) When working on #2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit 2a914aa)
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit 2a914aa) Signed-off-by: cherukum-amazon <cherukum@amazon.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit 2a914aa) Signed-off-by: cherukum-amazon <cherukum@amazon.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…on (#2749) When working on #2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit 2a914aa) Signed-off-by: cherukum-amazon <cherukum@amazon.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Problem
In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE.
Design
The main part of the design is that we will now listen for and process the
SYNCSLOTS ESTABLISHcommand on the replica. When aSYNCSLOTS ESTABLISHcommand is received from the primary, we begin tracking a new slot import in a specialSLOT_IMPORT_OCCURRING_ON_PRIMARYstate. Replicas use this state to track the import, and await for a futureSYNCSLOTS FINISHmessage that tells them the import is successful/failed.Success Case
Failure Case
Full Sync, Partial Sync, and RDB
In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new
cluster-slot-importsaux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with thecluster-slot-importsaux field, replicas will add a new migration in theSLOT_IMPORT_OCCURRING_ON_PRIMARYstate.It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the
cluster-slot-importsaux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get theirSYNCSLOTS FINISHmessage in the replication stream.AOF
Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the
SYNCSLOTS ESTABLISHandFINISHcommands to the AOF.However, considering there is work to change this (#59 #1901) this design doesn't make any assumptions about this.
We will propagate the
ESTABLISHandFINISHcommands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those inverifyClusterConfigWithData.Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients.
Promotion & Demotion
Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (
verifyClusterConifgWithData) and sending aSYNCSLOTS FINISH.Other Options Considered
I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot.
This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events.