-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Wait for replicas when shutting down #9872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bjosv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
The SHUTDOWN command is made a blocking command, unblocked only by a failure to shutdown (e.g. failure to write the rdb file).
fc0d040 to
77d4add
Compare
|
@zuiderkwast it occurred to me that there's probably an opportunity for some semi-related cleanup. while we are no the subject, i did notice that we're terminating the AOF child only if AOF is enabled, but currently it is also possible to do BGREWRITEAOF when it's disabled (see #9794), so i think the child should be stopped. |
This reverts commit ee6ee3a. The reverted commit broke the following test cases: *** [err]: diskless slow replicas drop during rdb pipe in tests/integration/replication.tcl log message of '"*Diskless rdb transfer, done reading from pipe, 1 replicas still up*"' not found in ./tests/tmp/server.2561.79/stdout after line: 60 till line: 80 *** [err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl log message of '"*Diskless rdb transfer, last replica dropped, killing fork child*"' not found in ./tests/tmp/server.2561.79/stdout after line: 110 till line: 131
|
@redis/core-team May I have your attention again? Two more things added in this PR:
|
|
@zuiderkwast thank you. |
|
@oranagra Thank YOU! Yes, I have updated the page about signal handling. It's merged already in redis/redis-doc#1711. |
…cked (#10440) fix #10439. see #9872 When executing SHUTDOWN we pause the client so we can un-pause it if the shutdown fails. this could happen during the timeout, if the shutdown is aborted, but could also happen from withing the initial `call()` to shutdown, if the rdb save fails. in that case when we return to `call()`, we'll crash if `c->cmd` has been set to NULL. The call stack is: ``` unblockClient(c) replyToClientsBlockedOnShutdown() cancelShutdown() finishShutdown() prepareForShutdown() shutdownCommand() ``` what's special about SHUTDOWN in that respect is that it can be paused, and then un-paused before the original `call()` returns. tests where added for both failed shutdown, and a followup successful one.
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.
Done:
Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.
Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
to call SHUTDOWN in parallel.
Note that they don't expect a response unless an error happens and shutdown is aborted.
Log warning for each replica lagging behind when finishing shutdown.
CLIENT_PAUSE_WRITE while waiting for replicas.
Configurable grace period 'shutdown-timeout' in seconds (default 10).
New flags for the SHUTDOWN command:
NOW disables the grace period for lagging replicas.
FORCE ignores errors writing the RDB or AOF files which would normally
prevent a shutdown.
ABORT cancels ongoing shutdown. Can't be combined with other flags.
New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
value is the remaining maximum time to wait for lagging replicas before
finishing the shutdown. This field is present in the Server section only
during shutdown.
Not directly related:
When shutting down, if there is an AOF saving child, it is killed even if AOF
is disabled. This can happen if BGREWRITEAOF is used when AOF is off.
Client pause now has end time and type (WRITE or ALL) per purpose. The
different pause purposes are CLIENT PAUSE command, failover and
shutdown. If clients are unpaused for one purpose, it doesn't affect client
pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
affect client pause initiated by the failover or shutdown procedures. A completed
failover or a failed shutdown doesn't unpause clients paused by the CLIENT
PAUSE command.
Notes:
DEBUG RESTART doesn't wait for replicas.
We already have a warning logged when a replica disconnects. This means that
if any replica connection is lost during the shutdown, it is either logged as
disconnected or as lagging at the time of exit.
Fixes #9693