Fix crash during SLAVEOF when clients are blocked on lazyfree #13853

sundb · 2025-03-11T14:08:59Z

After #13167, when a client calls FLUSHDB comand, we still async empty database, and the client was blocked until the lazyfree completes.

If another client calls SLAVEOF command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call updateStatsOnUnblock(), which ultimately triggered the following assertion.

Logged crash report (pid 281628):
=== REDIS BUG REPORT START: Cut & paste starting from here ===
281628:M 11 Mar 2025 22:05:26.489 # === ASSERTION FAILED ===
281628:M 11 Mar 2025 22:05:26.489 # ==> networking.c:2244 'c->duration == 0' is not true

------ STACK TRACE ------

281632 bio_lazy_free
/lib/x86_64-linux-gnu/libc.so.6(clock_nanosleep+0xbf)[0x7f3e9aeecadf]
/lib/x86_64-linux-gnu/libc.so.6(nanosleep+0x17)[0x7f3e9aef9a27]
/lib/x86_64-linux-gnu/libc.so.6(sleep+0x43)[0x7f3e9af0ec63]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x48c)[0x55b023c3030c]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]

281628 redis-server *
src/redis-server 127.0.0.1:21111(unblockClient+0x400)[0x55b023c53c30]
src/redis-server 127.0.0.1:21111(disconnectAllBlockedClients+0xcf)[0x55b023c53fbf]
src/redis-server 127.0.0.1:21111(replicationSetMaster+0x4c)[0x55b023ba15dc]
src/redis-server 127.0.0.1:21111(replicaofCommand+0xd0)[0x55b023ba21d0]
src/redis-server 127.0.0.1:21111(call+0x71c)[0x55b023b581cc]
src/redis-server 127.0.0.1:21111(processCommand+0xae5)[0x55b023b59005]
src/redis-server 127.0.0.1:21111(processInputBuffer+0xe9)[0x55b023b7bae9]
src/redis-server 127.0.0.1:21111(readQueryFromClient+0x368)[0x55b023b808d8]
src/redis-server 127.0.0.1:21111(+0x20a9c4)[0x55b023cb69c4]
src/redis-server 127.0.0.1:21111(aeMain+0xf9)[0x55b023b42029]
src/redis-server 127.0.0.1:21111(main+0x46d)[0x55b023b3556d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f3e9ae2a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f3e9ae2a28b]
src/redis-server 127.0.0.1:21111(_start+0x25)[0x55b023b36e05]

281630 bio_close_file
/lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x7f3e9ae98d71]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x7f3e9ae9b7ed]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x1ff)[0x55b023c3007f]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]

281631 bio_aof
/lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x7f3e9ae98d71]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x7f3e9ae9b7ed]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x1ff)[0x55b023c3007f]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]

If a client blocked by Lazyfree is unblocked midway, and at this point the bio_comp_list has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in flushallSyncBgDone(). Therefore, we need to filter it out.

failed CI: https://github.com/sundb/redis/actions/runs/13802290567/job/38606761833

sundb · 2025-03-11T14:09:19Z

CI with smoking test: https://github.com/sundb/redis/actions/runs/13790169463

oranagra

LGTM, but i'll take this opportunity to note something else i recently noticed which might be related.

in unblockClient we're calling resetClient, but for some flows (notably flushallSyncBgDone, we're calling both unblockClient and commandProcessed (which calls resetClient too).
currently, there's no harm for calling resetClient twice, but there could be (in my POC, there was).
i ended up adding an exception in unblockClient (like we have for BLOCKED_SHUTDOWN), but this is probably wrong, and we should better define who's responsible of clearing both of these (resetClient, and updateStatsOnUnblock in each flow, avoid calling them twice, or not calling them at all).

tests/unit/lazyfree.tcl

Co-authored-by: oranagra <oran@redislabs.com>

oranagra · 2025-03-12T09:32:24Z

changes LGTM, from my perspective, unless you want to proceed with the cleanup around unblockClient and commandProcessed that i pointed out, you can merge it.

sundb · 2025-03-12T09:38:54Z

changes LGTM, from my perspective, unless you want to proceed with the cleanup around unblockClient and commandProcessed that i pointed out, you can merge it.

@oranagra since this PR needs to backport to 7.4, let's put some other PR to handle it?

Fix #13853 (review) This PR ensures that the client's current command is not reset by unblockClient(), while still needing to be handled after `unblockclient()`. The FLUSH command still requires reprocessing (update the replication offset) after unblockClient(). Therefore, we mark such blocked clients with the CLIENT_PENDING_COMMAND flag to prevent the command from being reset during unblockClient().

This test was introduced by #13853 We determine if the client is in blocked status, but if async flushdb is completed before checking the blocked status, the test will fail. So modify the test to only determine if `lazyfree_pending_objects` is correct to ensure that flushdb is async, that is, the client must be blocked.

…13853) After redis#13167, when a client calls `FLUSHDB` command, we still async empty database, and the client was blocked until the lazyfree completes. 1) If another client calls `SLAVEOF` command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call `updateStatsOnUnblock()`, which ultimately triggered the following assertion. 2) If a client blocked by Lazyfree is unblocked midway, and at this point the `bio_comp_list` has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in `flushallSyncBgDone()`. Therefore, we need to filter it out. --------- Co-authored-by: oranagra <oran@redislabs.com>

Fix redis#13853 (review) This PR ensures that the client's current command is not reset by unblockClient(), while still needing to be handled after `unblockclient()`. The FLUSH command still requires reprocessing (update the replication offset) after unblockClient(). Therefore, we mark such blocked clients with the CLIENT_PENDING_COMMAND flag to prevent the command from being reset during unblockClient().

This test was introduced by redis#13853 We determine if the client is in blocked status, but if async flushdb is completed before checking the blocked status, the test will fail. So modify the test to only determine if `lazyfree_pending_objects` is correct to ensure that flushdb is async, that is, the client must be blocked.

sundb requested a review from moticless March 11, 2025 14:09

Smoking test

8d32fe8

sundb force-pushed the fix_duration_assertion branch from 8d641af to 8d32fe8 Compare March 11, 2025 14:18

Fix crash during SLAVEOF when clients are blocked on lazyfree

76f2bca

sundb requested a review from oranagra March 11, 2025 14:19

Add needs:debug

481e9cc

sundb force-pushed the fix_duration_assertion branch from 59c8a73 to 481e9cc Compare March 11, 2025 14:26

Add external:skip

1f0d753

oranagra approved these changes Mar 11, 2025

View reviewed changes

oranagra reviewed Mar 11, 2025

View reviewed changes

tests/unit/lazyfree.tcl Outdated Show resolved Hide resolved

sundb and others added 3 commits March 12, 2025 10:01

Use populate instead of debug populate

da84fda

Co-authored-by: oranagra <oran@redislabs.com>

Verfify the client is being blocked in flushallSyncBgDone()

b2f37e4

Add comment for test

0025781

sundb added this to Redis 8.2 Mar 12, 2025

github-project-automation bot moved this to Todo in Redis 8.2 Mar 12, 2025

sundb added this to Redis 7.4 Backport Mar 12, 2025

github-project-automation bot moved this to Todo in Redis 7.4 Backport Mar 12, 2025

sundb mentioned this pull request Mar 15, 2025

Fix messed-up unblocked clients in flush command #13865

Merged

sundb merged commit a5a3afd into redis:unstable Mar 17, 2025
18 checks passed

github-project-automation bot moved this from Todo to Done in Redis 8.2 Mar 17, 2025

sundb added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 21, 2025

sundb mentioned this pull request Apr 7, 2025

Fix timing issue in lazyfree test #13926

Merged

YaacovHazan moved this from Todo to Done in Redis 7.4 Backport Apr 29, 2025

sundb added this to Redis 8.0 Aug 15, 2025

sundb removed this from Redis 8.2 Aug 15, 2025

github-project-automation bot moved this to Todo in Redis 8.0 Aug 15, 2025

sundb moved this from Todo to Done in Redis 8.0 Aug 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix crash during SLAVEOF when clients are blocked on lazyfree #13853

Fix crash during SLAVEOF when clients are blocked on lazyfree #13853

Uh oh!

sundb commented Mar 11, 2025 •

edited

Loading

Uh oh!

sundb commented Mar 11, 2025 •

edited

Loading

Uh oh!

oranagra left a comment

Uh oh!

Uh oh!

oranagra commented Mar 12, 2025

Uh oh!

sundb commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix crash during SLAVEOF when clients are blocked on lazyfree #13853

Fix crash during SLAVEOF when clients are blocked on lazyfree #13853

Uh oh!

Conversation

sundb commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sundb commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oranagra commented Mar 12, 2025

Uh oh!

sundb commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sundb commented Mar 11, 2025 •

edited

Loading

sundb commented Mar 11, 2025 •

edited

Loading