Skip to content

Fix: prevent NULL dereference crash in connectSlotExportJob when target node disappears#3596

Merged
murphyjacob4 merged 1 commit into
valkey-io:unstablefrom
chenshi5012:fix/connect-slot-export-job-null-deref
Apr 30, 2026
Merged

Fix: prevent NULL dereference crash in connectSlotExportJob when target node disappears#3596
murphyjacob4 merged 1 commit into
valkey-io:unstablefrom
chenshi5012:fix/connect-slot-export-job-null-deref

Conversation

@chenshi5012

Copy link
Copy Markdown
Contributor

Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in connectSlotExportJob()
(src/cluster_migrateslots.c) that can crash a Valkey cluster node, causing a
denial-of-service condition.

Root Cause

When CLUSTER MIGRATESLOTS is issued, a migration job is created with state
SLOT_EXPORT_CONNECTING. On the next clusterCron() tick,
proceedWithSlotMigration() calls connectSlotExportJob(), which looks up the
target node via clusterLookupNode().

clusterLookupNode() can legitimately return NULL — for example, if the
target node is removed from the cluster (e.g. via CLUSTER FORGET) between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was never checked, so the subsequent call to
getNodeDefaultReplicationPort(n) immediately dereferences the NULL pointer,
crashing the process:

// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

…appears

When a CLUSTER MIGRATESLOTS job is created with state
SLOT_EXPORT_CONNECTING, the subsequent clusterCron call invokes
connectSlotExportJob(). This function calls clusterLookupNode() to
find the target node but never checks whether the return value is NULL.

If the target node is removed from the cluster (e.g. via CLUSTER FORGET)
between the time the migration job is created and the time the cron
fires, clusterLookupNode() returns NULL. The subsequent call to
getNodeDefaultReplicationPort(n) then dereferences the NULL pointer,
crashing the Valkey process with SIGSEGV (DoS).

Fix 1: Add an explicit NULL check in connectSlotExportJob(). When the
target node is not found, log a WARNING and return C_ERR so the caller
can mark the job as failed gracefully.

Fix 2: Guard the connGetLastError(job->conn) call in
proceedWithSlotMigration() against a NULL job->conn. When
connectSlotExportJob() fails before creating the connection object,
job->conn remains NULL; calling connGetLastError(NULL) would be a
second NULL dereference in the error-handling path.

Reported-by: Manus AI security audit
Tested-by: Reproduced crash with patched binary + ASAN, confirmed fix
           prevents crash while preserving graceful error handling.
Signed-off-by: chenshi5012 <chenshi5012@163.com>
@chenshi5012 chenshi5012 force-pushed the fix/connect-slot-export-job-null-deref branch from 53d2991 to bca3cbd Compare April 30, 2026 08:50

@murphyjacob4 murphyjacob4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I guess also we would have the same issue after starting the migration:

  1. Start migration
  2. On source: CLUSTER FORGET target
  3. On source: Migration does not stop
  4. On target: claim slots
  5. On source: we never hear the slot claim (target is blocklisted by FORGET)

So the source would stay paused for a long time and eventually fail the migration. But the migration link is kept up

We should probably just terminate the slot export when the node we are sending it to is forgotten.

But this check makes sense on its own as well

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@chenshi5012

Copy link
Copy Markdown
Contributor Author

@enjoy-binbin @murphyjacob4

The test failure (Import slot range with multiple slots) is a pre-existing flaky test
unrelated to this fix.

Evidence:

  • The failing test is wait_for_migration timing out after 10,000ms — the migration
    was started successfully but didn't complete in time due to CI resource contention.
  • The same test passes in 377ms on the upstream unstable branch CI run
    (run id 25147464160).
  • Our new NULL-check log message ("target node not found") does not appear anywhere
    in the CI log, confirming our code path was never triggered.
  • The CI was running 10+ heavy cluster tests in parallel
    (slave-stop-cond, resharding, replica-migration-slow, etc.)
    at the same time, causing significant resource contention.

Please re-run the failed job to confirm.

@codecov

codecov Bot commented Apr 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.61%. Comparing base (5b7ac66) to head (bca3cbd).
⚠️ Report is 4 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_migrateslots.c 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3596      +/-   ##
============================================
- Coverage     76.66%   76.61%   -0.05%     
============================================
  Files           160      160              
  Lines         80472    80476       +4     
============================================
- Hits          61690    61654      -36     
- Misses        18782    18822      +40     
Files with missing lines Coverage Δ
src/cluster_migrateslots.c 91.75% <50.00%> (-0.52%) ⬇️

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@murphyjacob4 murphyjacob4 moved this to To be backported in Valkey 9.0 Apr 30, 2026
@murphyjacob4 murphyjacob4 moved this to To be backported in Valkey 9.1 Apr 30, 2026
@murphyjacob4 murphyjacob4 merged commit cba0510 into valkey-io:unstable Apr 30, 2026
59 of 60 checks passed
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request May 7, 2026
…et node disappears (valkey-io#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
lucasyonge pushed a commit that referenced this pull request May 11, 2026
…et node disappears (#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
lucasyonge pushed a commit that referenced this pull request May 12, 2026
…et node disappears (#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
@sarthakaggarwal97 sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 May 16, 2026
valkeyrie-ops Bot pushed a commit that referenced this pull request May 20, 2026
…et node disappears (#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 14, 2026
…et node disappears (#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To be backported
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants