Send SIGKILL after SIGTERM when passing 95% memory#6419
Send SIGKILL after SIGTERM when passing 95% memory#6419crusaderky merged 2 commits intodask:mainfrom
Conversation
| self._last_terminated_pid = -1 | ||
|
|
||
| if memory / self.memory_limit > self.memory_terminate_fraction: | ||
| if self._last_terminated_pid != process.pid: |
There was a problem hiding this comment.
The default interval is 100ms. Isn't this a bit short for us to escalate to kill?
There was a problem hiding this comment.
Not really - that's 100ms worth of extra leaking from any task that is currently running.
Note that, to the best of my understanding, no cleanup code whatsoever runs on SIGTERM by default. So it should be immediate, unless the user tampered with the signal handlers - which they could legitimately do for their own cleanup code. A realistic use case is a database library that installs a SIGTERM handler to cleanly shut down its sockets.
In this case, I argue that prompt termination for a process that is close to go beyond 100% is more important the clean teardown.
There was a problem hiding this comment.
We do install our own signal handlers and close workers/nannies, e.g.
distributed/distributed/_signals.py
Line 11 in 7665eaa
distributed/distributed/cli/dask_worker.py
Lines 479 to 487 in 7665eaa
There was a problem hiding this comment.
I know. This doesn't exclude third party handlers.
Typical pattern:
def my_handler(signo, frame):
# TODO do your thing
prev(signo, frame)
prev = signal.signal(signal.SIGTERM, my_handler)There was a problem hiding this comment.
From what I understand, this should not be a problem since we install the handler on the parent process (i.e., the nanny itself), so it shouldn't be triggered by a SIGTERM to the worker child process.
Unit Test Results 15 files ± 0 15 suites ±0 6h 58m 30s ⏱️ + 14m 13s For more details on these failures, see this check. Results for commit 1042953. ± Comparison against base commit 97a7eb6. ♻️ This comment has been updated with latest results. |
|
Ready for review and merge |
| self._last_terminated_pid = -1 | ||
|
|
||
| if memory / self.memory_limit > self.memory_terminate_fraction: | ||
| if self._last_terminated_pid != process.pid: |
There was a problem hiding this comment.
From what I understand, this should not be a problem since we install the handler on the parent process (i.e., the nanny itself), so it shouldn't be triggered by a SIGTERM to the worker child process.
Closes #6373
CC @hendrikmakait