Fix: prevent NULL dereference crash in connectSlotExportJob when target node disappears#3596
Conversation
…appears
When a CLUSTER MIGRATESLOTS job is created with state
SLOT_EXPORT_CONNECTING, the subsequent clusterCron call invokes
connectSlotExportJob(). This function calls clusterLookupNode() to
find the target node but never checks whether the return value is NULL.
If the target node is removed from the cluster (e.g. via CLUSTER FORGET)
between the time the migration job is created and the time the cron
fires, clusterLookupNode() returns NULL. The subsequent call to
getNodeDefaultReplicationPort(n) then dereferences the NULL pointer,
crashing the Valkey process with SIGSEGV (DoS).
Fix 1: Add an explicit NULL check in connectSlotExportJob(). When the
target node is not found, log a WARNING and return C_ERR so the caller
can mark the job as failed gracefully.
Fix 2: Guard the connGetLastError(job->conn) call in
proceedWithSlotMigration() against a NULL job->conn. When
connectSlotExportJob() fails before creating the connection object,
job->conn remains NULL; calling connGetLastError(NULL) would be a
second NULL dereference in the error-handling path.
Reported-by: Manus AI security audit
Tested-by: Reproduced crash with patched binary + ASAN, confirmed fix
prevents crash while preserving graceful error handling.
Signed-off-by: chenshi5012 <chenshi5012@163.com>
53d2991 to
bca3cbd
Compare
murphyjacob4
left a comment
There was a problem hiding this comment.
Good catch! I guess also we would have the same issue after starting the migration:
- Start migration
- On source: CLUSTER FORGET target
- On source: Migration does not stop
- On target: claim slots
- On source: we never hear the slot claim (target is blocklisted by FORGET)
So the source would stay paused for a long time and eventually fail the migration. But the migration link is kept up
We should probably just terminate the slot export when the node we are sending it to is forgotten.
But this check makes sense on its own as well
|
The test failure ( Evidence:
Please re-run the failed job to confirm. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3596 +/- ##
============================================
- Coverage 76.66% 76.61% -0.05%
============================================
Files 160 160
Lines 80472 80476 +4
============================================
- Hits 61690 61654 -36
- Misses 18782 18822 +40
🚀 New features to boost your workflow:
|
…et node disappears (valkey-io#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
…et node disappears (#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
…et node disappears (#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
…et node disappears (#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
…et node disappears (#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
Summary
This PR fixes a NULL pointer dereference (SIGSEGV) in
connectSlotExportJob()(
src/cluster_migrateslots.c) that can crash a Valkey cluster node, causing adenial-of-service condition.
Root Cause
When
CLUSTER MIGRATESLOTSis issued, a migration job is created with stateSLOT_EXPORT_CONNECTING. On the nextclusterCron()tick,proceedWithSlotMigration()callsconnectSlotExportJob(), which looks up thetarget node via
clusterLookupNode().clusterLookupNode()can legitimately returnNULL— for example, if thetarget node is removed from the cluster (e.g. via
CLUSTER FORGET) between thetime the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.
The return value was never checked, so the subsequent call to
getNodeDefaultReplicationPort(n)immediately dereferences the NULL pointer,crashing the process: