Send SIGKILL after SIGTERM when passing 95% memory by crusaderky · Pull Request #6419 · dask/distributed

crusaderky · 2022-05-23T14:19:07Z

fjetter · 2022-05-23T14:32:47Z

distributed/worker_memory.py

-        self._last_terminated_pid = -1

-        if memory / self.memory_limit > self.memory_terminate_fraction:
+        if self._last_terminated_pid != process.pid:


The default interval is 100ms. Isn't this a bit short for us to escalate to kill?

Not really - that's 100ms worth of extra leaking from any task that is currently running.

Note that, to the best of my understanding, no cleanup code whatsoever runs on SIGTERM by default. So it should be immediate, unless the user tampered with the signal handlers - which they could legitimately do for their own cleanup code. A realistic use case is a database library that installs a SIGTERM handler to cleanly shut down its sockets.

In this case, I argue that prompt termination for a process that is close to go beyond 100% is more important the clean teardown.

We do install our own signal handlers and close workers/nannies, e.g.

distributed/distributed/_signals.py

Line 11 in 7665eaa

async def wait_for_signals(signals: list[signal.Signals]) -> None:

distributed/distributed/cli/dask_worker.py

Lines 479 to 487 in 7665eaa

async def wait_for_signals_and_close():

"""Wait for SIGINT or SIGTERM and close all nannies upon receiving one of those signals"""

nonlocal signal_fired

await wait_for_signals([signal.SIGINT, signal.SIGTERM])

signal_fired = True

if nanny:

# Unregister all workers from scheduler

await asyncio.gather(*(n.close(timeout=10) for n in nannies))

I know. This doesn't exclude third party handlers.
Typical pattern:

def my_handler(signo, frame): # TODO do your thing prev(signo, frame) prev = signal.signal(signal.SIGTERM, my_handler)

From what I understand, this should not be a problem since we install the handler on the parent process (i.e., the nanny itself), so it shouldn't be triggered by a SIGTERM to the worker child process.

github-actions · 2022-05-23T15:20:02Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 58m 30s ⏱️ + 14m 13s
  2 808 tests +  2   2 728 ✔️ +  3   79 💤 - 1 1 ❌ ±0
20 818 runs +15 19 884 ✔️ +10 933 💤 +5 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 1042953. ± Comparison against base commit 97a7eb6.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-23T18:46:49Z

Ready for review and merge

hendrikmakait

LGTM!

hendrikmakait · 2022-05-24T15:21:10Z

distributed/worker_memory.py

-        self._last_terminated_pid = -1

-        if memory / self.memory_limit > self.memory_terminate_fraction:
+        if self._last_terminated_pid != process.pid:


From what I understand, this should not be a problem since we install the handler on the parent process (i.e., the nanny itself), so it shouldn't be triggered by a SIGTERM to the worker child process.

Send SIGKILL if the worker ignores SIGTERM

bbd0e6b

Closes dask#6373

crusaderky self-assigned this May 23, 2022

crusaderky added the memory label May 23, 2022

crusaderky changed the title ~~Send SIGKILL if the worker ignores SIGTERM~~ Send SIGKILL after SIGTERM when passing 95% memory May 23, 2022

fjetter reviewed May 23, 2022

View reviewed changes

crusaderky marked this pull request as draft May 23, 2022 15:03

fix regression

1042953

crusaderky marked this pull request as ready for review May 23, 2022 18:46

hendrikmakait approved these changes May 24, 2022

View reviewed changes

crusaderky merged commit ba39915 into dask:main May 25, 2022

crusaderky deleted the slow_terminate branch May 25, 2022 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Send SIGKILL after SIGTERM when passing 95% memory#6419

Send SIGKILL after SIGTERM when passing 95% memory#6419
crusaderky merged 2 commits intodask:mainfrom
crusaderky:slow_terminate

crusaderky commented May 23, 2022 •

edited

Loading

Uh oh!

fjetter May 23, 2022

Uh oh!

crusaderky May 23, 2022

Uh oh!

fjetter May 24, 2022

Uh oh!

crusaderky May 24, 2022

Uh oh!

hendrikmakait May 24, 2022

Uh oh!

github-actions bot commented May 23, 2022 •

edited

Loading

Uh oh!

crusaderky commented May 23, 2022

Uh oh!

hendrikmakait left a comment

Uh oh!

hendrikmakait May 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	async def wait_for_signals_and_close():
	"""Wait for SIGINT or SIGTERM and close all nannies upon receiving one of those signals"""
	nonlocal signal_fired
	await wait_for_signals([signal.SIGINT, signal.SIGTERM])

	signal_fired = True
	if nanny:
	# Unregister all workers from scheduler
	await asyncio.gather(*(n.close(timeout=10) for n in nannies))

Uh oh!

Conversation

crusaderky commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter May 23, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 23, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter May 24, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 24, 2022

Choose a reason for hiding this comment

Uh oh!

hendrikmakait May 24, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented May 23, 2022

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

hendrikmakait May 24, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crusaderky commented May 23, 2022 •

edited

Loading

github-actions bot commented May 23, 2022 •

edited

Loading