-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix XREADGROUP BLOCK stuck in endless loop #12301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix XREADGROUP BLOCK stuck in endless loop #12301
Conversation
For the XREADGROUP BLOCK > scenario, there is an endless loop. Due to redis#11012, it keep going, reprocess command -> blockForKeys -> reprocess command The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys, looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good, and it looks like using a similar trick in handleClientsBlockedOnKey is complicated. i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use, is problematic since the code keeps accessing the global list. The fix is proposed by oranagra. Fixes redis#12290 Co-authored-by: Oran Agra <oran@redislabs.com>
|
daily with the fix: https://github.com/enjoy-binbin/redis/actions/runs/5238908365 |
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you confirm that the test reproduces the problem?
yes, it reproduces stably locally |
|
@guybe7 we get the following assertion in the schema validator: i suppose it's because a command was re-processed and blocked again without sending any reply. please advise |
|
i did not see the assertion before (it happened in module-api, I ignored it for granted) it failed on |
|
ohh, i looked at the assertion code in logreqres.c and wrongly concluded it's because of the new test. |
…ecution order test In redis#12301, we observed that if the `while(listLength(server.ready_keys) != 0)` in handleClientsBlockedOnKeys is changed to `if(listLength(server.ready_keys) != 0)`, the order of command execution will change. It is wrong to change that. It means that if a command being unblocked causes another command to get unblocked (like a BLMOVE would do), then the new unblocked command will wait for later to get processed rather than right away. It'll not have any real implication if we change that since we do call handleClientsBlockedOnKeys in beforeSleep again, and redis will still behave correctly, but we don't change that. An example: 1. $rd1 blmove src{t} dst{t} left right 0 2. $rd2 blmove dst{t} src{t} right left 0 3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline The correct order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. lmove dst{t} src{t} right left 5. set key2{t} The wrong order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. set key2{t} 5. lmove dst{t} src{t} right left This PR adds corresponding test to cover it.
…ecution order test (#12324) * Add command being unblocked cause another command to get unblocked execution order test In #12301, we observed that if the `while(listLength(server.ready_keys) != 0)` in handleClientsBlockedOnKeys is changed to `if(listLength(server.ready_keys) != 0)`, the order of command execution will change. It is wrong to change that. It means that if a command being unblocked causes another command to get unblocked (like a BLMOVE would do), then the new unblocked command will wait for later to get processed rather than right away. It'll not have any real implication if we change that since we do call handleClientsBlockedOnKeys in beforeSleep again, and redis will still behave correctly, but we don't change that. An example: 1. $rd1 blmove src{t} dst{t} left right 0 2. $rd2 blmove dst{t} src{t} right left 0 3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline The correct order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. lmove dst{t} src{t} right left 5. set key2{t} The wrong order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. set key2{t} 5. lmove dst{t} src{t} right left This PR adds corresponding test to cover it. * Add comment near while(listLength(server.ready_keys) != 0)
Apart from adding the missing coverage, this PR also adds `blockedBeforeSleep` that gathers all block-related functions from `beforeSleep` The order inside `blockedBeforeSleep` is different: now `handleClientsBlockedOnKeys` (which may unblock clients) is called before `processUnblockedClients` (which handles unblocked clients). It makes sense to have this order. There are no visible effects of the wrong ordering, except some cleanups of the now-unblocked client would have happen in the next `beforeSleep` (will now happen in the current one) The reason we even got into it is because i triggers an assertion in logresreq.c (breaking the assumption that `unblockClient` is called **before** actually flushing the reply to the socket): `handleClientsBlockedOnKeys` is called, then it calls `moduleUnblockClientOnKey`, which calls `moduleUnblockClient`, which adds the client to `moduleUnblockedClients` back to `beforeSleep`, we call `handleClientsWithPendingWritesUsingThreads`, it writes the data of buf to the client, so `client->bufpos` became 0 On the next `beforeSleep`, we call `moduleHandleBlockedClients`, which calls `unblockClient`, which calls `reqresAppendResponse`, triggering the assert. (because the `bufpos` is 0) - see #12301 (comment)
For the XREADGROUP BLOCK > scenario, there is an endless loop.
Due to #11012, XREADGROUP with ">" may repeatedly reprocess command and blockForKeys.
The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys,
looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good,
and it looks like using a similar trick in handleClientsBlockedOnKey is complicated.
i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use,
is problematic since the code keeps accessing the global list.
The fix is proposed by oranagra.
Fixes #12290