Skip to content

Fail in-progress jobs when the worker running them exits abnormally#277

Merged
rosa merged 5 commits intomainfrom
fail-jobs-when-worker-is-killed
Aug 21, 2024
Merged

Fail in-progress jobs when the worker running them exits abnormally#277
rosa merged 5 commits intomainfrom
fail-jobs-when-worker-is-killed

Conversation

@rosa
Copy link
Member

@rosa rosa commented Aug 12, 2024

This applies to:

  • Killed workers that the supervisor detects as dead.
  • Reaped workers without a clear exit status.
  • Orphaned executions that somehow lost their worker.
  • Workers whose heartbeat expired.

To do this easily, since the supervisor doesn't register all workers for efficiency, we need to rely on a new unique identifier that links the supervisor with their configured processes. Since the registration happens after forking, the supervisor doesn't know the registered process IDs of its supervised processes. This unique identifier is a name that gets randomly generated when the process is instantiated. This made me realise I was reusing the configured processes object to start new processes, which is quite prone to issues with already created thread pools and stuff like that 😬 Because of this, this PR also changes the approach to have the Configuration object return configured processes that need to be instantiated before starting, and each time create a new object.

@rosa rosa force-pushed the fail-jobs-when-worker-is-killed branch 2 times, most recently from 9848dae to 80dbef5 Compare August 12, 2024 18:09
rosa added 4 commits August 21, 2024 15:36
So we can uniquely identify processes by supervisor and name, without
having to rely on the PID, that can be duplicated across processes.
We were reusing the instances of Worker and Dispatcher from the initial
configuration all the time, which could bring some problems with stopped
pools. Now that we need a name to be generated and be unique per process
instance, we really need to instantiate new processes every time they're
started.
This applies to:
- Killed workers that the supervisor detects as dead.
- Reaped workers without a clear exit status.
- Orphaned executions that somehow lost their worker.
- Workers whose heartbeat expired.
@rosa rosa force-pushed the fail-jobs-when-worker-is-killed branch 2 times, most recently from 69f30b4 to 3945042 Compare August 21, 2024 13:39
As it won't be possible to start new processes after the column
is made NOT NULL and before deploying the code that uses that column.
@rosa rosa force-pushed the fail-jobs-when-worker-is-killed branch from 3945042 to 76d2c0f Compare August 21, 2024 13:45
@rosa rosa merged commit 89d30c7 into main Aug 21, 2024
@rosa rosa deleted the fail-jobs-when-worker-is-killed branch August 21, 2024 14:21
rosa added a commit that referenced this pull request Nov 27, 2024
Closes #422. Thanks to @salmonsteak1 for spotting this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant