Fix XREADGROUP BLOCK stuck in endless loop #12301

enjoy-binbin · 2023-06-12T00:57:10Z

For the XREADGROUP BLOCK > scenario, there is an endless loop.
Due to #11012, XREADGROUP with ">" may repeatedly reprocess command and blockForKeys.

The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys,
looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good,
and it looks like using a similar trick in handleClientsBlockedOnKey is complicated.
i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use,
is problematic since the code keeps accessing the global list.

The fix is proposed by oranagra.

Fixes #12290

For the XREADGROUP BLOCK > scenario, there is an endless loop. Due to redis#11012, it keep going, reprocess command -> blockForKeys -> reprocess command The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys, looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good, and it looks like using a similar trick in handleClientsBlockedOnKey is complicated. i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use, is problematic since the code keeps accessing the global list. The fix is proposed by oranagra. Fixes redis#12290 Co-authored-by: Oran Agra <oran@redislabs.com>

enjoy-binbin · 2023-06-12T02:37:45Z

daily with the fix: https://github.com/enjoy-binbin/redis/actions/runs/5238908365
daily without the fix: https://github.com/enjoy-binbin/redis/actions/runs/5238920402

src/blocked.c

tests/support/util.tcl

oranagra

did you confirm that the test reproduces the problem?

enjoy-binbin · 2023-06-12T06:58:27Z

did you confirm that the test reproduces the problem?

yes, it reproduces stably locally

oranagra · 2023-06-12T06:59:01Z

@guybe7 we get the following assertion in the schema validator:

22890:M 12 Jun 2023 03:21:45.863 # === ASSERTION FAILED ===
22890:M 12 Jun 2023 03:21:45.863 # ==> logreqres.c:281 'ret' is not true

i suppose it's because a command was re-processed and blocked again without sending any reply. please advise

enjoy-binbin · 2023-06-12T07:13:36Z

i did not see the assertion before (it happened in module-api, I ignored it for granted)
It's a bit strange to fail in module-api instead of before
it can reproduce locally, i will see if I can get the specific reason later

it failed on Module client blocked on keys: Circular BPOPPUSH

oranagra · 2023-06-12T09:31:23Z

ohh, i looked at the assertion code in logreqres.c and wrongly concluded it's because of the new test.
please have a look, if you conclude it's not related to this PR, we can leave it for later.

src/blocked.c

…ecution order test In redis#12301, we observed that if the `while(listLength(server.ready_keys) != 0)` in handleClientsBlockedOnKeys is changed to `if(listLength(server.ready_keys) != 0)`, the order of command execution will change. It is wrong to change that. It means that if a command being unblocked causes another command to get unblocked (like a BLMOVE would do), then the new unblocked command will wait for later to get processed rather than right away. It'll not have any real implication if we change that since we do call handleClientsBlockedOnKeys in beforeSleep again, and redis will still behave correctly, but we don't change that. An example: 1. $rd1 blmove src{t} dst{t} left right 0 2. $rd2 blmove dst{t} src{t} right left 0 3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline The correct order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. lmove dst{t} src{t} right left 5. set key2{t} The wrong order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. set key2{t} 5. lmove dst{t} src{t} right left This PR adds corresponding test to cover it.

…ecution order test (#12324) * Add command being unblocked cause another command to get unblocked execution order test In #12301, we observed that if the `while(listLength(server.ready_keys) != 0)` in handleClientsBlockedOnKeys is changed to `if(listLength(server.ready_keys) != 0)`, the order of command execution will change. It is wrong to change that. It means that if a command being unblocked causes another command to get unblocked (like a BLMOVE would do), then the new unblocked command will wait for later to get processed rather than right away. It'll not have any real implication if we change that since we do call handleClientsBlockedOnKeys in beforeSleep again, and redis will still behave correctly, but we don't change that. An example: 1. $rd1 blmove src{t} dst{t} left right 0 2. $rd2 blmove dst{t} src{t} right left 0 3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline The correct order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. lmove dst{t} src{t} right left 5. set key2{t} The wrong order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. set key2{t} 5. lmove dst{t} src{t} right left This PR adds corresponding test to cover it. * Add comment near while(listLength(server.ready_keys) != 0)

Apart from adding the missing coverage, this PR also adds `blockedBeforeSleep` that gathers all block-related functions from `beforeSleep` The order inside `blockedBeforeSleep` is different: now `handleClientsBlockedOnKeys` (which may unblock clients) is called before `processUnblockedClients` (which handles unblocked clients). It makes sense to have this order. There are no visible effects of the wrong ordering, except some cleanups of the now-unblocked client would have happen in the next `beforeSleep` (will now happen in the current one) The reason we even got into it is because i triggers an assertion in logresreq.c (breaking the assumption that `unblockClient` is called **before** actually flushing the reply to the socket): `handleClientsBlockedOnKeys` is called, then it calls `moduleUnblockClientOnKey`, which calls `moduleUnblockClient`, which adds the client to `moduleUnblockedClients` back to `beforeSleep`, we call `handleClientsWithPendingWritesUsingThreads`, it writes the data of buf to the client, so `client->bufpos` became 0 On the next `beforeSleep`, we call `moduleHandleBlockedClients`, which calls `unblockClient`, which calls `reqresAppendResponse`, triggering the assert. (because the `bufpos` is 0) - see #12301 (comment)

enjoy-binbin requested a review from oranagra June 12, 2023 00:58

yossigo added the release-notes indication that this issue needs to be mentioned in the release notes label Jun 12, 2023

oranagra reviewed Jun 12, 2023

View reviewed changes

src/blocked.c Outdated Show resolved Hide resolved

src/blocked.c Outdated Show resolved Hide resolved

tests/support/util.tcl Outdated Show resolved Hide resolved

rewrite the test, thanks oran

960bdc5

oranagra approved these changes Jun 12, 2023

View reviewed changes

enjoy-binbin commented Jun 12, 2023

View reviewed changes

src/blocked.c Outdated Show resolved Hide resolved

oranagra reviewed Jun 12, 2023

View reviewed changes

src/blocked.c Show resolved Hide resolved

revert the while change

2399c31

oranagra merged commit e7129e4 into redis:unstable Jun 13, 2023

enjoy-binbin deleted the fix_xreadgroup_block_endless_loop branch June 13, 2023 10:28

enjoy-binbin mentioned this pull request Jun 16, 2023

Add command being unblocked cause another command to get unblocked execution order test #12324

Merged

guybe7 mentioned this pull request Jun 21, 2023

Modules: Unblock from within a timer coverage #12337

Merged

oranagra mentioned this pull request Jul 10, 2023

Relase 7.2-RC3 #12402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix XREADGROUP BLOCK stuck in endless loop #12301

Fix XREADGROUP BLOCK stuck in endless loop #12301

Uh oh!

enjoy-binbin commented Jun 12, 2023

Uh oh!

enjoy-binbin commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra left a comment

Uh oh!

enjoy-binbin commented Jun 12, 2023

Uh oh!

oranagra commented Jun 12, 2023 •

edited

Loading

Uh oh!

enjoy-binbin commented Jun 12, 2023 •

edited

Loading

Uh oh!

oranagra commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix XREADGROUP BLOCK stuck in endless loop #12301

Fix XREADGROUP BLOCK stuck in endless loop #12301

Uh oh!

Conversation

enjoy-binbin commented Jun 12, 2023

Uh oh!

enjoy-binbin commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jun 12, 2023

Uh oh!

oranagra commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oranagra commented Jun 12, 2023 •

edited

Loading

enjoy-binbin commented Jun 12, 2023 •

edited

Loading