Wait for replicas when shutting down #9872

zuiderkwast · 2021-11-30T15:47:22Z

To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.

Done:

Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.
Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
to call SHUTDOWN in parallel.
Note that they don't expect a response unless an error happens and shutdown is aborted.
Log warning for each replica lagging behind when finishing shutdown.
CLIENT_PAUSE_WRITE while waiting for replicas.
Configurable grace period 'shutdown-timeout' in seconds (default 10).
New flags for the SHUTDOWN command:
- NOW disables the grace period for lagging replicas.
- FORCE ignores errors writing the RDB or AOF files which would normally
  prevent a shutdown.
- ABORT cancels ongoing shutdown. Can't be combined with other flags.
New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
value is the remaining maximum time to wait for lagging replicas before
finishing the shutdown. This field is present in the Server section only
during shutdown.

Not directly related:

When shutting down, if there is an AOF saving child, it is killed even if AOF
is disabled. This can happen if BGREWRITEAOF is used when AOF is off.
Client pause now has end time and type (WRITE or ALL) per purpose. The
different pause purposes are CLIENT PAUSE command, failover and
shutdown. If clients are unpaused for one purpose, it doesn't affect client
pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
affect client pause initiated by the failover or shutdown procedures. A completed
failover or a failed shutdown doesn't unpause clients paused by the CLIENT
PAUSE command.

Notes:

DEBUG RESTART doesn't wait for replicas.
We already have a warning logged when a replica disconnects. This means that
if any replica connection is lost during the shutdown, it is either logged as
disconnected or as lagging at the time of exit.

Fixes #9693

bjosv

Nice!

src/server.c

The SHUTDOWN command is made a blocking command, unblocked only by a failure to shutdown (e.g. failure to write the rdb file).

src/db.c

src/server.c

tests/integration/shutdown.tcl

src/server.c

oranagra · 2021-12-09T10:57:18Z

@zuiderkwast it occurred to me that there's probably an opportunity for some semi-related cleanup.
when we get SIGTERM during loading, we immediately exit.
IIRC this code was written before prepareForShutdown had the NOSAVE option (which was added for the SHUTDOWN command).
Now that it has it, i think it's better to go though prepareForShutdown, and pass the appropriate flags (NOSAVE / FORCE).
There are no should be no fork children at that time, but it's still a good idea to go though other cleanup steps (e.g. modules).

while we are no the subject, i did notice that we're terminating the AOF child only if AOF is enabled, but currently it is also possible to do BGREWRITEAOF when it's disabled (see #9794), so i think the child should be stopped.
So, if you're already working in that area, we can make additional cleanups.

tests/integration/shutdown.tcl

tests/support/util.tcl

src/server.c

tests/integration/shutdown.tcl

src/db.c

src/server.c

This reverts commit ee6ee3a. The reverted commit broke the following test cases: *** [err]: diskless slow replicas drop during rdb pipe in tests/integration/replication.tcl log message of '"*Diskless rdb transfer, done reading from pipe, 1 replicas still up*"' not found in ./tests/tmp/server.2561.79/stdout after line: 60 till line: 80 *** [err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl log message of '"*Diskless rdb transfer, last replica dropped, killing fork child*"' not found in ./tests/tmp/server.2561.79/stdout after line: 110 till line: 131

src/networking.c

tests/unit/shutdown.tcl

zuiderkwast · 2021-12-30T13:58:10Z

@redis/core-team May I have your attention again? Two more things added in this PR:

Client pause per purpose (failover, shutdown, client pause command)
SHUTDOWN ABORT

oranagra · 2022-01-02T07:51:39Z

@zuiderkwast thank you.
I suppose there are some docs that will benefit from an update about this change.
obviously the SHUTDOWN command (which must refer to the new config, and new args), but maybe some other places too?

zuiderkwast · 2022-01-02T20:46:52Z

@oranagra Thank YOU! Yes, I have updated the page about signal handling. It's merged already in redis/redis-doc#1711.

…cked (#10440) fix #10439. see #9872 When executing SHUTDOWN we pause the client so we can un-pause it if the shutdown fails. this could happen during the timeout, if the shutdown is aborted, but could also happen from withing the initial `call()` to shutdown, if the rdb save fails. in that case when we return to `call()`, we'll crash if `c->cmd` has been set to NULL. The call stack is: ``` unblockClient(c) replyToClientsBlockedOnShutdown() cancelShutdown() finishShutdown() prepareForShutdown() shutdownCommand() ``` what's special about SHUTDOWN in that respect is that it can be paused, and then un-paused before the original `call()` returns. tests where added for both failed shutdown, and a followup successful one.

zuiderkwast added 6 commits November 29, 2021 13:36

Add test case for shutdown with slow replica

0c10205

Corrections to test case for shutdown with slow replica

2d848b4

Wait for lagging replicas before shutting down on SIGTERM

2622252

Add shutdown-timeout config

57bb471

Fix typo

4094829

Allow more time in test case (for slow jobs)

bb572e0

zuiderkwast marked this pull request as ready for review November 30, 2021 18:20

zuiderkwast requested a review from yossigo December 1, 2021 11:24

bjosv reviewed Dec 2, 2021

View reviewed changes

src/server.c Show resolved Hide resolved

zuiderkwast added 5 commits December 2, 2021 16:10

Add timeout test, adjust and test log messages

7929250

Revert handling of CTRL+C pressed twice

4c0782f

Make the SHUTDOWN command wait for replicas

a2bf5e3

The SHUTDOWN command is made a blocking command, unblocked only by a failure to shutdown (e.g. failure to write the rdb file).

Add DEBUG PREVENT-SHUTDOWN and test failed shutdown

a02cb4e

Attempt to fix slow santitizer test job

77d4add

zuiderkwast force-pushed the graceful-shutdown branch from fc0d040 to 77d4add Compare December 3, 2021 21:03

zuiderkwast mentioned this pull request Dec 5, 2021

Avoid accessing the expire dict for keys without expire #9547

Closed

zuiderkwast requested a review from oranagra December 6, 2021 20:35

oranagra reviewed Dec 8, 2021

View reviewed changes

zuiderkwast commented Dec 8, 2021

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

oranagra mentioned this pull request Dec 9, 2021

Shutdown gracefully RedisLabs/redisraft#139

Closed

zuiderkwast added 3 commits December 10, 2021 00:31

Fix review comments

c941a29

SHUTDOWN NOW and FORCE flags

5a2d26a

Merge remote-tracking branch 'redis/unstable' into graceful-shutdown

01fb950

oranagra reviewed Dec 12, 2021

View reviewed changes

tests/integration/shutdown.tcl Outdated Show resolved Hide resolved

tests/support/util.tcl Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

tests/integration/shutdown.tcl Outdated Show resolved Hide resolved

zuiderkwast added 3 commits December 13, 2021 14:10

Fix typos, comments and INFO output

a7ae31d

Remove special SIGTERM handling while loading

ee6ee3a

Kill AOF saving child even if AOF is disabled (for BGREWRITEAOF)

8617025

oranagra reviewed Dec 13, 2021

View reviewed changes

src/db.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

src/server.c Show resolved Hide resolved

oranagra mentioned this pull request Dec 26, 2021

Safe and organized exit when receiving sigterm while loading #10003

Merged

zuiderkwast added 2 commits December 29, 2021 23:40

Client pause/unpause per purpose

6fe56d0

Increase max shutdown-timeout to INT_MAX

3b66653

oranagra reviewed Dec 30, 2021

View reviewed changes

src/networking.c Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

zuiderkwast added 2 commits December 30, 2021 13:11

Review comments on pause per purpose

027c3d7

Add SHUTDOWN ABORT (with tests)

4bbdffe

oranagra approved these changes Dec 30, 2021

View reviewed changes

tests/unit/shutdown.tcl Show resolved Hide resolved

Check logged notice in test case

7a400dd

zuiderkwast added 5 commits December 30, 2021 15:04

Add ABORT to shutdown.json

75751f4

Merge branch 'unstable' into graceful-shutdown

6b01ff9

Add BLOCKING flag and generate commands.c

d6e63d6

FIXUP: BLOCKING is not a command flag...

a0698e3

Forbid SHUTDOWN for DENY BLOCKING clients

a01665c

oranagra approved these changes Dec 30, 2021

View reviewed changes

Add history to shutdown.json

1e932e3

oranagra merged commit 45a155b into redis:unstable Jan 2, 2022

oranagra mentioned this pull request Jan 11, 2022

Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas #10092

Merged

This was referenced Jan 25, 2022

Redis INFO fields redis/redis-doc#1748

Closed

Fix shutdown_in_milliseconds wrong name in INFO redis/redis-doc#1750

Merged

madolson mentioned this pull request Feb 6, 2022

Fix client pause timeout after failover #9467

Closed

zuiderkwast deleted the graceful-shutdown branch March 17, 2022 12:33

warriorguo mentioned this pull request Mar 19, 2022

unblockClient: avoid to reset client when the client was shutdown-blocked #10440

Merged

oranagra mentioned this pull request Aug 7, 2022

[NEW] Delete AOF on successful SAVE for shutdown #11076

Open

eduardobr mentioned this pull request Sep 4, 2022

Make it possible for a master to restart from AOF and still be able to serve PSYNC #9796

Open

LiiNen mentioned this pull request Mar 18, 2024

Make shutdown command more intuitively with syntax #13153

Open

Wait for replicas when shutting down #9872

Wait for replicas when shutting down #9872

Uh oh!

Conversation

zuiderkwast commented Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra commented Dec 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Dec 30, 2021

Uh oh!

oranagra commented Jan 2, 2022

Uh oh!

zuiderkwast commented Jan 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

zuiderkwast commented Nov 30, 2021 •

edited

Loading