Don't connect to cluster subprocesses at shutdown by gjoseph92 · Pull Request #6829 · dask/distributed

gjoseph92 · 2022-08-04T16:24:07Z

Closes #6828. See that issue for explanation.

The one thing we lose with this over the RPC is a clean close call—kill is more abrupt. If we'd like to maintain the clean close, I can easily do that by sending a SIGINT, waiting, then SIGKILL only if necessary.

Tests added / passed
Passes pre-commit run --all-files

gjoseph92 · 2022-08-04T16:24:32Z

distributed/utils_test.py

    worker_kwargs=None,
    active_rpc_timeout=10,
-    disconnect_timeout=20,
+    shutdown_timeout=20,


This argument name was unused across the codebase, so changing it seems fine.

gjoseph92 · 2022-08-04T16:25:31Z

distributed/utils_test.py

-                async def close():
-                    logger.debug("Closing out test cluster")
-                    alive_workers = [
-                        w["address"]
-                        for w in workers_by_pid.values()
-                        if w["proc"].is_alive()
-                    ]
-                    await disconnect_all(
-                        alive_workers,
-                        timeout=disconnect_timeout,
-                        rpc_kwargs=rpc_kwargs,
-                    )
-                    if scheduler.is_alive():
-                        await disconnect(
-                            saddr, timeout=disconnect_timeout, rpc_kwargs=rpc_kwargs
-                        )


The change is that I just deleted this entire finally section (and therefore removed the try block). The stack.callback(_kill_join, scheduler, shutdown_timeout) will accomplish an equivalent thing without RPCs.

github-actions · 2022-08-04T18:13:08Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 53m 55s ⏱️ + 13m 1s
  2 989 tests ±0   2 895 ✔️ - 3     89 💤 ±0 4 ❌ +3 1 🔥 ±0
22 165 runs - 1 21 111 ✔️ - 5 1 047 💤 - 1 6 ❌ +5 1 🔥 ±0

For more details on these failures and errors, see this check.

Results for commit 993f420. ± Comparison against base commit 4af2d0a.

gjoseph92 · 2022-08-04T18:43:26Z

Failures look unrelated:

All Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731
test_repeated_restarts seems like a new flaky test, but it wouldn't be affected by this. I feel like maybe Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427 would fix it.

graingert · 2022-08-05T17:08:14Z

distributed/utils_test.py

-    proc.join()
+def _kill_join(proc, timeout):
+    proc.kill()
+    proc.join(timeout)


Maybe rely on the pytest timeout?

Also maybe kill them all at the same time then join them all

Don't connect to cluster subprocesses at shutdown

07803f0

gjoseph92 requested review from fjetter and graingert August 4, 2022 16:24

gjoseph92 commented Aug 4, 2022

View reviewed changes

gjoseph92 mentioned this pull request Aug 4, 2022

Only set 5s connect timeout in gen_cluster tests #6822

Merged

1 task

remove unused disconnect functions

993f420

gjoseph92 self-assigned this Aug 4, 2022

gjoseph92 mentioned this pull request Aug 5, 2022

Flaky distributed/tests/test_nanny.py::test_repeated_restarts #6838

Open

graingert approved these changes Aug 5, 2022

View reviewed changes

graingert reviewed Aug 5, 2022

View reviewed changes

gjoseph92 merged commit e1f3779 into dask:main Aug 5, 2022

gjoseph92 deleted the cluster-contextmanager-no-rpc-at-end branch August 5, 2022 17:10

gjoseph92 mentioned this pull request Aug 5, 2022

Clean up cluster process reaping #6840

Merged

2 tasks

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Don't connect to cluster subprocesses at shutdown (dask#6829)

e99129f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't connect to cluster subprocesses at shutdown#6829

Don't connect to cluster subprocesses at shutdown#6829
gjoseph92 merged 2 commits intodask:mainfrom
gjoseph92:cluster-contextmanager-no-rpc-at-end

gjoseph92 commented Aug 4, 2022

Uh oh!

gjoseph92 Aug 4, 2022

Uh oh!

gjoseph92 Aug 4, 2022

Uh oh!

github-actions bot commented Aug 4, 2022

Uh oh!

gjoseph92 commented Aug 4, 2022

Uh oh!

graingert Aug 5, 2022

Uh oh!

graingert Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gjoseph92 commented Aug 4, 2022

Uh oh!

gjoseph92 Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 4, 2022

Unit Test Results

Uh oh!

gjoseph92 commented Aug 4, 2022

Uh oh!

graingert Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

graingert Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants