Request
Currently our stop package scripts will stop every container running on a given machine in dependency order using docker stop, which by default will send the main process inside of the container SIGTERM followed by SIGKILL after a 10 second grace period.
Unfortunately, the compression scheduler currently has no signal handler to initiate a graceful shutdown on SIGTERM, and a 10 second grace period is insufficient for a graceful shutdown in our system. This means that the current setup can leave jobs in a bad state or potentially duplicate or lose data if a shutdown occurs while compression jobs are running.
Possible implementation
To ensure that running compression jobs complete before shutdown the following two conditions should be sufficient, even in a distributed setup:
- Upon receiving
SIGTERM a compression worker finishes any currently running job and shuts down without fetching a new job
- Upon receiving
SIGTERM the compression scheduler stops dispatching new jobs and waits for currently running jobs to finish before terminating.
Since we can not guarantee liveness we must also support forcefully terminating after some grace period (implemented by sending a SIGKILL or otherwise).
Celery workers already register signal handlers that do most of what we want by default (except they seem to use SIGTERM/SIGINT/SIGQUIT).
We would also need to write a signal handler for the compression scheduler that handles graceful shutdown as described.
After that it would just be a matter of modifying our stop script to send the compression worker and scheduler a "soft" kill signal followed by a "hard" kill signal after a reasonable timeout (e.g. 5 minutes). The "hard" timeout could be made configurable via command line argument on the package stop script.
Request
Currently our stop package scripts will stop every container running on a given machine in dependency order using
docker stop, which by default will send the main process inside of the containerSIGTERMfollowed bySIGKILLafter a 10 second grace period.Unfortunately, the compression scheduler currently has no signal handler to initiate a graceful shutdown on
SIGTERM, and a 10 second grace period is insufficient for a graceful shutdown in our system. This means that the current setup can leave jobs in a bad state or potentially duplicate or lose data if a shutdown occurs while compression jobs are running.Possible implementation
To ensure that running compression jobs complete before shutdown the following two conditions should be sufficient, even in a distributed setup:
SIGTERMa compression worker finishes any currently running job and shuts down without fetching a new jobSIGTERMthe compression scheduler stops dispatching new jobs and waits for currently running jobs to finish before terminating.Since we can not guarantee liveness we must also support forcefully terminating after some grace period (implemented by sending a
SIGKILLor otherwise).Celery workers already register signal handlers that do most of what we want by default (except they seem to use
SIGTERM/SIGINT/SIGQUIT).We would also need to write a signal handler for the compression scheduler that handles graceful shutdown as described.
After that it would just be a matter of modifying our stop script to send the compression worker and scheduler a "soft" kill signal followed by a "hard" kill signal after a reasonable timeout (e.g. 5 minutes). The "hard" timeout could be made configurable via command line argument on the package stop script.