-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix crash during SLAVEOF when clients are blocked on lazyfree #13853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CI with smoking test: https://github.com/sundb/redis/actions/runs/13790169463 |
8d641af to
8d32fe8
Compare
59c8a73 to
481e9cc
Compare
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but i'll take this opportunity to note something else i recently noticed which might be related.
in unblockClient we're calling resetClient, but for some flows (notably flushallSyncBgDone, we're calling both unblockClient and commandProcessed (which calls resetClient too).
currently, there's no harm for calling resetClient twice, but there could be (in my POC, there was).
i ended up adding an exception in unblockClient (like we have for BLOCKED_SHUTDOWN), but this is probably wrong, and we should better define who's responsible of clearing both of these (resetClient, and updateStatsOnUnblock in each flow, avoid calling them twice, or not calling them at all).
Co-authored-by: oranagra <oran@redislabs.com>
|
changes LGTM, from my perspective, unless you want to proceed with the cleanup around |
@oranagra since this PR needs to backport to 7.4, let's put some other PR to handle it? |
Fix #13853 (review) This PR ensures that the client's current command is not reset by unblockClient(), while still needing to be handled after `unblockclient()`. The FLUSH command still requires reprocessing (update the replication offset) after unblockClient(). Therefore, we mark such blocked clients with the CLIENT_PENDING_COMMAND flag to prevent the command from being reset during unblockClient().
This test was introduced by #13853 We determine if the client is in blocked status, but if async flushdb is completed before checking the blocked status, the test will fail. So modify the test to only determine if `lazyfree_pending_objects` is correct to ensure that flushdb is async, that is, the client must be blocked.
…13853) After redis#13167, when a client calls `FLUSHDB` command, we still async empty database, and the client was blocked until the lazyfree completes. 1) If another client calls `SLAVEOF` command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call `updateStatsOnUnblock()`, which ultimately triggered the following assertion. 2) If a client blocked by Lazyfree is unblocked midway, and at this point the `bio_comp_list` has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in `flushallSyncBgDone()`. Therefore, we need to filter it out. --------- Co-authored-by: oranagra <oran@redislabs.com>
…13853) After redis#13167, when a client calls `FLUSHDB` command, we still async empty database, and the client was blocked until the lazyfree completes. 1) If another client calls `SLAVEOF` command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call `updateStatsOnUnblock()`, which ultimately triggered the following assertion. 2) If a client blocked by Lazyfree is unblocked midway, and at this point the `bio_comp_list` has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in `flushallSyncBgDone()`. Therefore, we need to filter it out. --------- Co-authored-by: oranagra <oran@redislabs.com>
Fix redis#13853 (review) This PR ensures that the client's current command is not reset by unblockClient(), while still needing to be handled after `unblockclient()`. The FLUSH command still requires reprocessing (update the replication offset) after unblockClient(). Therefore, we mark such blocked clients with the CLIENT_PENDING_COMMAND flag to prevent the command from being reset during unblockClient().
This test was introduced by redis#13853 We determine if the client is in blocked status, but if async flushdb is completed before checking the blocked status, the test will fail. So modify the test to only determine if `lazyfree_pending_objects` is correct to ensure that flushdb is async, that is, the client must be blocked.
After #13167, when a client calls
FLUSHDBcomand, we still async empty database, and the client was blocked until the lazyfree completes.SLAVEOFcommand during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to callupdateStatsOnUnblock(), which ultimately triggered the following assertion.bio_comp_listhas already received the completion notification for the bio, we might end up processing a client that has already been unblocked influshallSyncBgDone(). Therefore, we need to filter it out.failed CI: https://github.com/sundb/redis/actions/runs/13802290567/job/38606761833