Fix resource allocation for tasks with dependencies by hendrikmakait · Pull Request #6676 · dask/distributed

hendrikmakait · 2022-07-06T08:46:08Z

Closes #6663
Blocked by

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-07-06T09:36:53Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 5h 48m 12s ⏱️ - 47m 16s
  2 950 tests +  11   2 861 ✔️ +  10   85 💤 +1 3 ❌ - 1 1 🔥 +1
21 047 runs - 734 20 057 ✔️ - 727 986 💤 - 7 3 ❌ - 1 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 086178b. ± Comparison against base commit 40867c7.

♻️ This comment has been updated with latest results.

hendrikmakait · 2022-07-06T10:19:54Z

distributed/worker_state_machine.py

            assert next_state in {"waiting", "fetch"}, next_state
            assert ts._previous in {"executing", "flight"}, ts._previous

+            if ts._previous == "executing":


TODO: Write test that ensures that we do in fact release IFF the task is finished.

This and line 1888 are missing the use case of 'long-running'. I'll open a follow-up PR about it, no point adding scope creep here.

I'd like to have the test you mention within this PR please.

ts._previous should never happen to begin with in this method. I'll clean up the code in here in a separate PR.

xref: #6693

crusaderky · 2022-07-06T17:17:44Z

Please merge from main. Also there are failing tests

hendrikmakait

Rebased onto the latest master to reduce the diff. Two tests have been XFAILed to be tackled in independent PRs. The corresponding issues are #6565 and #6682.

hendrikmakait · 2022-07-06T18:51:07Z

distributed/tests/test_worker_state_machine.py

+    assert ws.available_resources == {"R": 1}
+
+
+@pytest.mark.xfail(reason="distributed#6565")


XFAIL to be tackled in a follow-up PR. #6565 is already tracking this issue.

This is already green

Good catch, somehow missed that.

hendrikmakait · 2022-07-06T18:51:33Z

distributed/tests/test_worker_state_machine.py

+    assert ws.available_resources == {"R": 1}
+
+
+@pytest.mark.xfail(reason="distributed#6682")


XFAIL to be tackled in a follow-up PR. #6682 has been created to tackle this issue.

hendrikmakait · 2022-07-07T09:47:14Z

Failing tests are known flakes.

distributed/worker_state_machine.py

distributed/tests/test_worker_state_machine.py

crusaderky · 2022-07-07T12:41:11Z

distributed/tests/test_worker_state_machine.py

+            stimulus_id="compute",
+        )
+    )
+    assert ws.tasks["x"].state == "resumed"


Could you review this?
From reading the code (see _transition_from_resumed), "resumed" should be exclusively on one of the following loops:

executing or long-running -> cancelled -> fetch

flight -> cancelled -> waiting

in the executing -> cancelled -> waiting loop that you implemented here, I expect ts.state to be 'executing'.

This test does not create an executing -> cancelled -> waiting loop, but an executing -> cancelled -> fetch loop by cancelling x and then gathering it as a dependency to y.

crusaderky

.

crusaderky · 2022-07-07T12:43:57Z

It's probably a good idea to park this momentarily and write two separate PRs,

all_running_tasks plus unit tests
dummy methods for the two events

hendrikmakait · 2022-07-08T09:34:45Z

distributed/utils_test.py

            SecedeEvent(key="x", compute_duration=1.0, stimulus_id="secede")
        )
    assert ws.tasks["x"].state == request.param
-    assert ws.available_resources == {"R": 0}


Moved this into the dedicated test_ws_with_running_task

hendrikmakait · 2022-07-08T09:37:28Z

distributed/tests/test_worker_state_machine.py



+@pytest.mark.parametrize("state", ["executing", "long-running"])
+def test_running_constrained_task_acquires_resources(state, ws):


Duplicating logic from ws_with_running_task and test_ws_with_running_task to have an explicit test focused on resource restrictions that is resilient to changes to those functions.

hendrikmakait · 2022-07-08T10:00:48Z

CI issues are being caused by #6692. I'd move _validate_resources to a separate PR that will be blocked by #6692 in order to unblock this PR. Thoughts, @crusaderky?

…date_state

…sk to _transition_waiting_ready

Co-authored-by: crusaderky <crusaderky@gmail.com>

crusaderky · 2022-07-09T11:00:07Z

distributed/tests/test_worker_state_machine.py

+    assert ws.available_resources == {"R": 1}
+
+    instructions = ws.handle_stimulus(
+        GatherDepSuccessEvent("gather-dep-done", "127.0.0.1:1235", 8, {"x": 1.0})


This is unreadable, please never go full-positional

crusaderky · 2022-07-09T11:44:31Z

distributed/worker_state_machine.py

            if ts._previous in ("executing", "long-running"):
                self._release_resources(ts)
+                self.executing.discard(ts)
+                self.long_running.discard(ts)


This is actually superfluous as we're invoking purge_state afterwards. But I think it's a good idea to have it for cleanliness' sake.

crusaderky · 2022-07-09T11:47:40Z

See code review: 74defe4
There's an xfail in test_resumed_task_releases_resources:

E       AssertionError: assert {'R': 0} == {'R': 1}
E         Differing items:
E         {'R': 0} != {'R': 1}
E         Full diff:
E         - {'R': 1}
E         ?       ^
E         + {'R': 0}
E         ?       ^

The culprits are, again, #6689 + #6693.
_transition_from_resumed is not hit in case of error, so the change in this PR does not have the effect you hoped.

I'm going to merge the PR as-is, but it means it doesn't close #6663.
I'll resolve with #6689, #6693, and #6663 in a single PR before I leave.

hendrikmakait · 2022-07-09T12:06:01Z

There is a separate issue open around the issue with failing resumed tasks: #6682. This PR should solve the reproducer in #6663 (it works just fine on my machine), so I'd close that issue with merging and keep #6682 open.

hendrikmakait · 2022-07-09T12:07:08Z

@crusaderky: Thanks for the thorough review!

hendrikmakait force-pushed the fix-resource-over-allocation branch from 6a589a5 to e238af1 Compare July 6, 2022 10:16

hendrikmakait commented Jul 6, 2022

View reviewed changes

hendrikmakait force-pushed the fix-resource-over-allocation branch from 472f62b to 60b0fbb Compare July 6, 2022 14:15

hendrikmakait requested a review from crusaderky July 6, 2022 14:16

hendrikmakait self-assigned this Jul 6, 2022

hendrikmakait force-pushed the fix-resource-over-allocation branch from 2796a56 to 25e2e01 Compare July 6, 2022 18:37

hendrikmakait mentioned this pull request Jul 6, 2022

Resumed tasks don't release resources if they fail #6682

Closed

hendrikmakait commented Jul 6, 2022

View reviewed changes

hendrikmakait marked this pull request as ready for review July 6, 2022 18:53

hendrikmakait changed the title ~~Fix resource allocation to tasks~~ Fix resource allocation for tasks with dependencies Jul 6, 2022