Request
When the compression scheduler restarts some jobs in the database may already be in RUNNING state because:
- The compression scheduler crashed/terminated and lost track of a job that could either be completed or still running
- The compression scheduler marked a job as
RUNNING and subsequently restarted, but no workers attempted to run the task (e.g. because they themselves were shut down)
The second scenario is very likely to happen after #1037 is implemented if the compression scheduler and compression workers are on the same machine, because we first shut down the workers before attempting to shut down the compression scheduler. This makes it likely that for some period of time the compression scheduler continues to issue jobs after the compression workers stop accepting new jobs.
Possible implementation
This problem can be solved in two parts:
- On restart, dispatch any jobs in
RUNNING state to the compression workers
- This allows us to retry jobs that either finished after the compression scheduler crashed/terminated, or never started because workers were not accepting jobs
- This wouldn't allow us to address the scenario where a job is dispatched before the compression scheduler crashes and terminates after the compression scheduler restarts, but handling this case well would likely complicate state management
- Attempt to gracefully terminate the compression scheduler and compression workers at the same time to minimize the period where the compression scheduler continues dispatching jobs while the compression workers no longer accept jobs.
Request
When the compression scheduler restarts some jobs in the database may already be in
RUNNINGstate because:RUNNINGand subsequently restarted, but no workers attempted to run the task (e.g. because they themselves were shut down)The second scenario is very likely to happen after #1037 is implemented if the compression scheduler and compression workers are on the same machine, because we first shut down the workers before attempting to shut down the compression scheduler. This makes it likely that for some period of time the compression scheduler continues to issue jobs after the compression workers stop accepting new jobs.
Possible implementation
This problem can be solved in two parts:
RUNNINGstate to the compression workers