Skip to content

Fix potential memory leaks when a writable replica is promoted to primary after direct writes of keys with expiration#2953

Merged
ranshid merged 9 commits into
valkey-io:unstablefrom
li-benson:fix_replciaExpireDict
Jan 28, 2026
Merged

Fix potential memory leaks when a writable replica is promoted to primary after direct writes of keys with expiration#2953
ranshid merged 9 commits into
valkey-io:unstablefrom
li-benson:fix_replciaExpireDict

Conversation

@li-benson

Copy link
Copy Markdown
Contributor

The bug mentioned in this #2952 has been fixed.

@dvkashapov

dvkashapov commented Dec 19, 2025

Copy link
Copy Markdown
Member

Can't we somewhere in replica write logic call rememberreplicaKeyWithExpire()? That way we won't need to manually clean up anything or am I missing something?

Also, can you please signoff commits using git commit --signoff to pass DCO check?

@codecov

codecov Bot commented Dec 19, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.28%. Comparing base (0ee4234) to head (9b55702).
⚠️ Report is 47 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2953      +/-   ##
============================================
- Coverage     74.29%   74.28%   -0.01%     
============================================
  Files           129      129              
  Lines         70974    70977       +3     
============================================
- Hits          52731    52727       -4     
- Misses        18243    18250       +7     
Files with missing lines Coverage Δ
src/server.c 89.52% <100.00%> (+0.09%) ⬆️

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@li-benson

Copy link
Copy Markdown
Contributor Author

Can't we somewhere in replica write logic call rememberreplicaKeyWithExpire()? That way we won't need to manually clean up anything or am I missing something?

Also, can you please signoff commits using git commit --signoff to pass DCO check?

Yes, currently, when a writable replica receives a write operation for a key with an expiration time, it invokes the rememberReplicaKeyWithExpire function. This function stores the key in the replicaKeysWithExpire dictionary, which is used to proactively and periodically clean up these expired keys on the replica. However, when the node is promoted to a primary, this dictionary is no longer required—and the current logic lacks any mechanism to clean it up, resulting in a memory leak.
Our application scenario is as follows: during server maintenance, we first switch all traffic to the replica, and then promote it to a primary node. During this process, we observed a memory leak on the newly promoted primary node, so a fix is required.

@li-benson li-benson force-pushed the fix_replciaExpireDict branch from 679a224 to abb7706 Compare December 19, 2025 07:31
@dvkashapov

dvkashapov commented Dec 19, 2025

Copy link
Copy Markdown
Member

during server maintenance, we first switch all traffic to the replica, and then promote it to a primary node.

Wow, that's unusual, don't you want to let your client to retry writes to primary after promotion?

But the need to expire those keys makes sense to me. If we plan to have repl_expire_clean_enabled always enabled, maybe we need that logic inside activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW); and add comment to signify that this is needed when writable replica is used?

But other than that - LGTM

@dvkashapov dvkashapov self-requested a review December 19, 2025 07:49
@li-benson

Copy link
Copy Markdown
Contributor Author

during server maintenance, we first switch all traffic to the replica, and then promote it to a primary node.

Wow, that's unusual, don't you want to let your client to retry writes to primary after promotion?

But the need to expire those keys makes sense to me. If we plan to have repl_expire_clean_enabled always enabled, maybe we need that logic inside activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW); and add comment to signify that this is needed when writable replica is used?

But other than that - LGTM

during server maintenance, we first switch all traffic to the replica, and then promote it to a primary node.

Wow, that's unusual, don't you want to let your client to retry writes to primary after promotion?

But the need to expire those keys makes sense to me. If we plan to have repl_expire_clean_enabled always enabled, maybe we need that logic inside activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW); and add comment to signify that this is needed when writable replica is used?

But other than that - LGTM

At first, I also considered whether to place this piece of logic into activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW). However, upon further reflection, I realized there are still differences between the two functionalities: activeExpireCycle impacts actual keys (e.g., deleting them), whereas flushReplicaKeysWithExpireList only cleans up useless data, frees leaked memory, and has no impact on keys whatsoever. I believe this functionality does not fall under the scope of active expiration and should not be governed by the active_expire_enabled parameter; instead, it ought to be controlled by a separate parameter repl_expire_clean_enabled. What do you think?

@dvkashapov

Copy link
Copy Markdown
Member

If we have some key that shouldn't be here then we need to delete, I don't think we need a parameter here to control that, thats why maybe it's better to put this check alongside activeExpireCycle?

@li-benson

Copy link
Copy Markdown
Contributor Author

If we have some key that shouldn't be here then we need to delete, I don't think we need a parameter here to control that, thats why maybe it's better to put this check alongside activeExpireCycle?

If we have some key that shouldn't be here then we need to delete, I don't think we need a parameter here to control that, thats why maybe it's better to put this check alongside activeExpireCycle?

On second thought, your consideration makes sense. We can remove this newly added parameter as well—it will cut down unnecessary redundancy. I’ll make the changes.

…pired keys to replica

Signed-off-by: li-benson <1260437731@qq.com>
@li-benson li-benson force-pushed the fix_replciaExpireDict branch from abb7706 to db85fed Compare December 19, 2025 08:54

@dvkashapov dvkashapov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but DCO is failing, need to signoff again

…pired keys to replica

Signed-off-by: li-benson <1260437731@qq.com>
@li-benson li-benson force-pushed the fix_replciaExpireDict branch from b473ef3 to d89c876 Compare December 19, 2025 09:02
@ranshid

ranshid commented Dec 23, 2025

Copy link
Copy Markdown
Member
  1. We are getting rid of the dictionary but the keys themselves (which were never replicated) are still in the main dictionary right? these keys could be replicated right after when the old replica will restart and sync with the new promoted primary and there might still be a leakage since there is nothing that will free these keys.
  2. So the way I know the writable replica feature, it is mainly used when RFR is used and there is a need to store some intermediate data (eg some side calculation like a set union or something). Doing this would basically mean that such operations would break in case a failover occurred and they are done outside of a script/transaction context (not sure how important this case is though).

IMO it is fine to keep the 2nd limitation (ie do NOT persist the replica written keys after failover) but #1 is something that needs attention IMO. maybe it is important to flush them and delete them when a replica is promoted, rather than to only delete the expiration dictionary?
It might also be possible to keep the keys and manage expiring them as part of the primary expiration cycle, but it would also require to avoid replicating these keys (which is not trivial at all)

@li-benson

Copy link
Copy Markdown
Contributor Author
  1. We are getting rid of the dictionary but the keys themselves (which were never replicated) are still in the main dictionary right? these keys could be replicated right after when the old replica will restart and sync with the new promoted primary and there might still be a leakage since there is nothing that will free these keys.
  2. So the way I know the writable replica feature, it is mainly used when RFR is used and there is a need to store some intermediate data (eg some side calculation like a set union or something). Doing this would basically mean that such operations would break in case a failover occurred and they are done outside of a script/transaction context (not sure how important this case is though).

IMO it is fine to keep the 2nd limitation (ie do NOT persist the replica written keys after failover) but #1 is something that needs attention IMO. maybe it is important to flush them and delete them when a replica is promoted, rather than to only delete the expiration dictionary? It might also be possible to keep the keys and manage expiring them as part of the primary expiration cycle, but it would also require to avoid replicating these keys (which is not trivial at all)

Yes, writing expired keys to a replica stores them in both the main dict and replica expiration dict (for proactive expiration). When the replica becomes primary, the replica expiration dict becomes obsolete and unused—undeniably a memory leak.

Are you saying that besides this metadata, the actual replica-written keys & values are also unused by the primary and should be cleaned up (and not replicated to new replicas)? This is reasonable in theory, but seems overly radical to me. We cannot guarantee the primary will never need this data. In our failover scenario: traffic is switched to the replica first, then the replica is promoted to primary—we need this data post-promotion, so deleting it would contradict expectations.

Maybe we can be conservative for the time being?

@ranshid

ranshid commented Dec 23, 2025

Copy link
Copy Markdown
Member
  1. We are getting rid of the dictionary but the keys themselves (which were never replicated) are still in the main dictionary right? these keys could be replicated right after when the old replica will restart and sync with the new promoted primary and there might still be a leakage since there is nothing that will free these keys.
  2. So the way I know the writable replica feature, it is mainly used when RFR is used and there is a need to store some intermediate data (eg some side calculation like a set union or something). Doing this would basically mean that such operations would break in case a failover occurred and they are done outside of a script/transaction context (not sure how important this case is though).

IMO it is fine to keep the 2nd limitation (ie do NOT persist the replica written keys after failover) but #1 is something that needs attention IMO. maybe it is important to flush them and delete them when a replica is promoted, rather than to only delete the expiration dictionary? It might also be possible to keep the keys and manage expiring them as part of the primary expiration cycle, but it would also require to avoid replicating these keys (which is not trivial at all)

Yes, writing expired keys to a replica stores them in both the main dict and replica expiration dict (for proactive expiration). When the replica becomes primary, the replica expiration dict becomes obsolete and unused—undeniably a memory leak.

Are you saying that besides this metadata, the actual replica-written keys & values are also unused by the primary and should be cleaned up (and not replicated to new replicas)? This is reasonable in theory, but seems overly radical to me. We cannot guarantee the primary will never need this data. In our failover scenario: traffic is switched to the replica first, then the replica is promoted to primary—we need this data post-promotion, so deleting it would contradict expectations.

Maybe we can be conservative for the time being?

I am saying that the "leak" is not really being fixed. not only the keys are kept on the primary follow the failover, they are also potentially replicated to the replicas which are performing full sync follow this failover.
In general there are many holes in the writable replicas design:

  1. Not all written keys expiration is managed only those written to dbid < 64)
  2. these keys persist after failover, but only on full synced replicas. this means that the cluster data would be out of sync following failover and this would not show on the repl offset.

I think there is no "good" solution in the existing implementation. Probably better solution should be that each key written on a replica is also recorded in the replicaKeysWithExpire, even when it has NO expiration (meaning it should be remembered in dbAddInternal). We can maintain separation of these keys by verify and fail the replication or delete the key when a key is attempted to be written to by the replica client and it exists in the "remember dict". We can also flush all these keys right when we become primary (and not via background active expiration job) - maybe in replicationUnsetPrimary?

Maybe we can be conservative for the time being?

We can. I think that for keys with expiration, they should probably get expired eventually, so I guess this will not introduce any degradation.

@ranshid ranshid left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks simple enough. Left some comments.
One thing I do think about is if this should have been an info field or not?
I know some prefer avoid adding more and more info fields, but IMO it makes it much easier to identify issues and help debug cases when it is exposed in info.
@madolson (IIRC you said you prefer to avoid adding unnecessary info fields) WDYT?

Comment thread tests/unit/expire.tcl Outdated
Comment thread tests/unit/expire.tcl Outdated
Comment thread tests/unit/expire.tcl Outdated
Comment thread tests/unit/expire.tcl Outdated
Comment thread tests/unit/expire.tcl Outdated
Comment thread tests/unit/expire.tcl Outdated
Comment thread src/server.c

@ranshid ranshid left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, accept the indication via DEBUG command rather than an info field.
I know actually some other maintainers PREFER to avoid adding info fields, so will wait to hear their opinions

Comment thread tests/unit/expire.tcl
@ranshid

ranshid commented Jan 6, 2026

Copy link
Copy Markdown
Member

@li-benson IMO this can be placed in the Memory, Debug or Replication sections of the INFO output. Debug commands are mostly used in order to allow tests to run, but I find this information something that applications can still use in some ways. Personally I feel it can be part of the Memory section, since this is probably the main purpose of this data.

…pired keys to replica

Signed-off-by: li-benson <1260437731@qq.com>
@li-benson

Copy link
Copy Markdown
Contributor Author

@li-benson IMO this can be placed in the Memory, Debug or Replication sections of the INFO output. Debug commands are mostly used in order to allow tests to run, but I find this information something that applications can still use in some ways. Personally I feel it can be part of the Memory section, since this is probably the main purpose of this data.

I just found this information in info Stats: slave_expires_tracked_keys. It's redundant for me to use the debug command to get it, so I'll make a modification.

…pired keys to replica

Signed-off-by: li-benson <1260437731@qq.com>
…pired keys to replica

Signed-off-by: li-benson <1260437731@qq.com>
@ranshid ranshid merged commit f5d7dc8 into valkey-io:unstable Jan 28, 2026
24 checks passed
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
…mary after direct writes of keys with expiration (valkey-io#2953)

The bug mentioned in this
valkey-io#2952 has been fixed.

---------

Signed-off-by: li-benson <1260437731@qq.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Mar 5, 2026
…mary after direct writes of keys with expiration (valkey-io#2953)

The bug mentioned in this
valkey-io#2952 has been fixed.

---------

Signed-off-by: li-benson <1260437731@qq.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants