Skip to content

Conversation

@sundb
Copy link
Collaborator

@sundb sundb commented Mar 11, 2025

After #13167, when a client calls FLUSHDB comand, we still async empty database, and the client was blocked until the lazyfree completes.

  1. If another client calls SLAVEOF command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call updateStatsOnUnblock(), which ultimately triggered the following assertion.
Logged crash report (pid 281628):
=== REDIS BUG REPORT START: Cut & paste starting from here ===
281628:M 11 Mar 2025 22:05:26.489 # === ASSERTION FAILED ===
281628:M 11 Mar 2025 22:05:26.489 # ==> networking.c:2244 'c->duration == 0' is not true

------ STACK TRACE ------

281632 bio_lazy_free
/lib/x86_64-linux-gnu/libc.so.6(clock_nanosleep+0xbf)[0x7f3e9aeecadf]
/lib/x86_64-linux-gnu/libc.so.6(nanosleep+0x17)[0x7f3e9aef9a27]
/lib/x86_64-linux-gnu/libc.so.6(sleep+0x43)[0x7f3e9af0ec63]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x48c)[0x55b023c3030c]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]

281628 redis-server *
src/redis-server 127.0.0.1:21111(unblockClient+0x400)[0x55b023c53c30]
src/redis-server 127.0.0.1:21111(disconnectAllBlockedClients+0xcf)[0x55b023c53fbf]
src/redis-server 127.0.0.1:21111(replicationSetMaster+0x4c)[0x55b023ba15dc]
src/redis-server 127.0.0.1:21111(replicaofCommand+0xd0)[0x55b023ba21d0]
src/redis-server 127.0.0.1:21111(call+0x71c)[0x55b023b581cc]
src/redis-server 127.0.0.1:21111(processCommand+0xae5)[0x55b023b59005]
src/redis-server 127.0.0.1:21111(processInputBuffer+0xe9)[0x55b023b7bae9]
src/redis-server 127.0.0.1:21111(readQueryFromClient+0x368)[0x55b023b808d8]
src/redis-server 127.0.0.1:21111(+0x20a9c4)[0x55b023cb69c4]
src/redis-server 127.0.0.1:21111(aeMain+0xf9)[0x55b023b42029]
src/redis-server 127.0.0.1:21111(main+0x46d)[0x55b023b3556d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f3e9ae2a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f3e9ae2a28b]
src/redis-server 127.0.0.1:21111(_start+0x25)[0x55b023b36e05]

281630 bio_close_file
/lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x7f3e9ae98d71]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x7f3e9ae9b7ed]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x1ff)[0x55b023c3007f]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]

281631 bio_aof
/lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x7f3e9ae98d71]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x7f3e9ae9b7ed]
src/redis-server 127.0.0.1:21111(bioProcessBackgroundJobs+0x1ff)[0x55b023c3007f]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f3e9ae9caa4]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7f3e9af29c3c]
  1. If a client blocked by Lazyfree is unblocked midway, and at this point the bio_comp_list has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in flushallSyncBgDone(). Therefore, we need to filter it out.

failed CI: https://github.com/sundb/redis/actions/runs/13802290567/job/38606761833

@sundb
Copy link
Collaborator Author

sundb commented Mar 11, 2025

CI with smoking test: https://github.com/sundb/redis/actions/runs/13790169463

@sundb sundb requested a review from moticless March 11, 2025 14:09
@sundb sundb force-pushed the fix_duration_assertion branch from 8d641af to 8d32fe8 Compare March 11, 2025 14:18
@sundb sundb requested a review from oranagra March 11, 2025 14:19
@sundb sundb force-pushed the fix_duration_assertion branch from 59c8a73 to 481e9cc Compare March 11, 2025 14:26
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but i'll take this opportunity to note something else i recently noticed which might be related.

in unblockClient we're calling resetClient, but for some flows (notably flushallSyncBgDone, we're calling both unblockClient and commandProcessed (which calls resetClient too).
currently, there's no harm for calling resetClient twice, but there could be (in my POC, there was).
i ended up adding an exception in unblockClient (like we have for BLOCKED_SHUTDOWN), but this is probably wrong, and we should better define who's responsible of clearing both of these (resetClient, and updateStatsOnUnblock in each flow, avoid calling them twice, or not calling them at all).

@oranagra
Copy link
Member

changes LGTM, from my perspective, unless you want to proceed with the cleanup around unblockClient and commandProcessed that i pointed out, you can merge it.

@sundb
Copy link
Collaborator Author

sundb commented Mar 12, 2025

changes LGTM, from my perspective, unless you want to proceed with the cleanup around unblockClient and commandProcessed that i pointed out, you can merge it.

@oranagra since this PR needs to backport to 7.4, let's put some other PR to handle it?

@sundb sundb merged commit a5a3afd into redis:unstable Mar 17, 2025
18 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in Redis 8.2 Mar 17, 2025
sundb added a commit that referenced this pull request Mar 19, 2025
Fix #13853 (review)

This PR ensures that the client's current command is not reset by
unblockClient(), while still needing to be handled after `unblockclient()`.
The FLUSH command still requires reprocessing (update the replication
offset) after unblockClient(). Therefore, we mark such blocked clients
with the CLIENT_PENDING_COMMAND flag to prevent the command from being
reset during unblockClient().
@sundb sundb added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 21, 2025
sundb added a commit that referenced this pull request Apr 13, 2025
This test was introduced by #13853
We determine if the client is in blocked status, but if async flushdb is
completed before checking the blocked status, the test will fail.
So modify the test to only determine if `lazyfree_pending_objects` is
correct to ensure that flushdb is async, that is, the client must be
blocked.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
…13853)

After redis#13167, when a client calls
`FLUSHDB` command, we still async empty database, and the client was
blocked until the lazyfree completes.

1) If another client calls `SLAVEOF` command during this time, the
server will unblock all blocked clients, including those blocked by the
lazyfree. However, when unblocking a lazyfree blocked client, we forgot
to call `updateStatsOnUnblock()`, which ultimately triggered the
following assertion.

2) If a client blocked by Lazyfree is unblocked midway, and at this
point the `bio_comp_list` has already received the completion
notification for the bio, we might end up processing a client that has
already been unblocked in `flushallSyncBgDone()`. Therefore, we need to
filter it out.

---------

Co-authored-by: oranagra <oran@redislabs.com>
@YaacovHazan YaacovHazan moved this from Todo to Done in Redis 7.4 Backport Apr 29, 2025
@sundb sundb added this to Redis 8.0 Aug 15, 2025
@sundb sundb removed this from Redis 8.2 Aug 15, 2025
@sundb sundb moved this from Todo to Done in Redis 8.0 Aug 15, 2025
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…13853)

After redis#13167, when a client calls
`FLUSHDB` command, we still async empty database, and the client was
blocked until the lazyfree completes.

1) If another client calls `SLAVEOF` command during this time, the
server will unblock all blocked clients, including those blocked by the
lazyfree. However, when unblocking a lazyfree blocked client, we forgot
to call `updateStatsOnUnblock()`, which ultimately triggered the
following assertion.

2) If a client blocked by Lazyfree is unblocked midway, and at this
point the `bio_comp_list` has already received the completion
notification for the bio, we might end up processing a client that has
already been unblocked in `flushallSyncBgDone()`. Therefore, we need to
filter it out.

---------

Co-authored-by: oranagra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
Fix redis#13853 (review)

This PR ensures that the client's current command is not reset by
unblockClient(), while still needing to be handled after `unblockclient()`.
The FLUSH command still requires reprocessing (update the replication
offset) after unblockClient(). Therefore, we mark such blocked clients
with the CLIENT_PENDING_COMMAND flag to prevent the command from being
reset during unblockClient().
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
This test was introduced by redis#13853
We determine if the client is in blocked status, but if async flushdb is
completed before checking the blocked status, the test will fail.
So modify the test to only determine if `lazyfree_pending_objects` is
correct to ensure that flushdb is async, that is, the client must be
blocked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants