Ensure resumed tasks are not accidentally forgotten by fjetter · Pull Request #6217 · dask/distributed

fjetter · 2022-04-27T11:29:40Z

closes #6194

Tests added / passed
Passes pre-commit run --all-files

Three changes in this PR

Fix an erroneous transition resumed->released which previously actually released a key but instead it should transition to cancelled if the task is still being processed.
Removed a redundant cancelled->resumed transition which makes the logs much less verbose and easier to read. This transition was only an indirection and I instead inlined the code at the two places where this matters. The transition log now reads as expected in these situations
Ensure that ts._next is not set for cancelled tasks. Cancelled tasks should always transition to released once they are done.

distributed/utils_test.py

fjetter · 2022-04-27T12:08:03Z

The windows test failure is a known issue

#6147

File "D:\a\distributed\distributed\distributed\worker.py", line 4053, in validate_task

    self.validate_task_fetch(ts)

  File "D:\a\distributed\distributed\distributed\worker.py", line 3995, in validate_task_fetch

    assert ts.who_has

AssertionError

2022-04-27 11:54:02,400 - distributed.core - ERROR - Invalid TaskState encountered for <TaskState "('arange-sum-1960f5e08fa9c9689f5e9ef3be470[377](https://github.com/dask/distributed/runs/6192747991?check_suite_focus=true#step:11:377)', 8)" fetch>.

Story:

[("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'compute-task', 'compute-task-1651060440.6118023', 1651060440.640618), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'waiting', 'waiting', {"('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)": 'ready'}, 'compute-task-1651060440.6118023', 1651060440.6406472), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'waiting', 'ready', 'ready', {}, 'compute-task-1651060440.6118023', 1651060440.6406634), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ready', 'executing', 'executing', {}, 'task-finished-1651060440.648988', 1651060440.6492662), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'put-in-memory', 'task-finished-1651060440.6506114', 1651060440.6508203), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'executing', 'memory', 'memory', {"('arange-sum-59ee542fb6a8f7c418bd1ceadfe68d11', 8)": 'executing'}, 'task-finished-1651060440.6506114', 1651060440.65087), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ensure-task-exists', 'memory', 'compute-task-1651060440.6795359', 1651060440.705263), ('free-keys', ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)",), 'task-finished-1651060440.7063837', 1651060440.743776), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'release-key', 'task-finished-1651060440.7063837', 1651060440.74[379](https://github.com/dask/distributed/runs/6192747991?check_suite_focus=true#step:11:379)23), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'memory', 'released', 'released', {}, 'task-finished-1651060440.706[383](https://github.com/dask/distributed/runs/6192747991?check_suite_focus=true#step:11:383)7', 1651060440.74[384](https://github.com/dask/distributed/runs/6192747991?check_suite_focus=true#step:11:384)28), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'forgotten', 'forgotten', {}, 'client-releases-keys-1651060442.0950112', 1651060442.1340976), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'compute-task', 'compute-task-1651060442.1040328', 1651060442.1347113), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'waiting', 'waiting', {"('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)": 'ready'}, 'compute-task-1651060442.1040328', 1651060442.1347415), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'waiting', 'ready', 'ready', {}, 'compute-task-1651060442.1040328', 1651060442.1347587), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ready', 'executing', 'executing', {}, 'task-finished-1651060442.1677341', 1651060442.1682491), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'put-in-memory', 'task-finished-1651060442.169614', 1651060442.1697793), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'executing', 'memory', 'memory', {"('arange-sum-b267f04bd650ea06d6abf5234529c789', 8)": 'executing'}, 'task-finished-1651060442.169614', 1651060442.1698287), ('free-keys', ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)",), 'client-releases-keys-1651060442.290155', 1651060442.3110044), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'release-key', 'client-releases-keys-1651060442.290155', 1651060442.3110235), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'memory', 'released', 'released', {"('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)": 'forgotten'}, 'client-releases-keys-1651060442.290155', 1651060442.311108), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'forgotten', 'forgotten', {}, 'client-releases-keys-1651060442.290155', 1651060442.3111255), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'compute-task', 'compute-task-1651060442.3011189', 1651060442.3126717), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'waiting', 'waiting', {"('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)": 'ready'}, 'compute-task-1651060442.3011189', 1651060442.3127015), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'waiting', 'ready', 'ready', {}, 'compute-task-1651060442.3011189', 1651060442.3127184), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ensure-task-exists', 'ready', 'compute-task-1651060442.3024514', 1651060442.3128352), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'release-key', 'compute-task-1651060442.3024514', 1651060442.3128846), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ready', 'released', 'released', {}, 'compute-task-1651060442.3024514', 1651060442.3129065), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'released', 'fetch', 'fetch', {}, 'compute-task-1651060442.3024514', 1651060442.312925), ("('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)", 'ready', 'fetch', 'fetch', {}, 'compute-task-1651060442.3024514', 1651060442.31293)]

github-actions · 2022-04-27T13:29:59Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 23m 52s ⏱️ - 9m 40s
  2 760 tests +  3   2 680 ✔️ +  5     78 💤 - 1 2 ❌ - 1
22 042 runs +24 21 023 ✔️ +25 1 017 💤 +1 2 ❌ - 2

For more details on these failures, see this check.

Results for commit 1e23e1e. ± Comparison against base commit 2286896.

♻️ This comment has been updated with latest results.

mrocklin · 2022-04-29T14:28:35Z

I took a brief look at this. No objection from me, but I didn't dive deeply into the logic. If tests pass and you're feeling confident about the added value @fjetter I think that it's ok to merge. If you can get someone like @crusaderky or @gjoseph92 to take a look that would be better of course.

gjoseph92

I've never understood the flow for cancelled and released tasks, so I don't think I'm in a good position to review this. I find these states incredibly confusing. I'm looking forward to this hopefully making more sense with #5895.

distributed/worker.py

gjoseph92 · 2022-04-29T22:10:55Z

distributed/worker.py

+        if not ts.done:
+            ts.state = "cancelled"
+            ts._next = None
+            return {}, []


What's going to eventually pick this up and move it out of cancelled if there are no recommendations and no next?

TLDR Once ts.done = True is set, i.e. execute/flight is done

ts._next should never have been set for cancelled. When I implemented cancelled/resumed states I made a few mistakes. The only relevant next state for cancelled is released. That's the entire point of the state. The worker was instructed to release a key but it can't because it is "stuck" waiting for something to finish, i.e. either the execution thread or the gather_data coroutine.
Once execution/gather finishes, they'll recommend a transition, e.g. upon success they'll recommend a transition to memory. For example cancelled->memory will ensure the key is released.

Why is this logic not directly implemented as part of the gather_dep/execute result parser? Well, have a look at the code there. Particularly the gather_dep result parser/finally clause is the most frequent source of deadlocks because the logic just blows up.
There is a bit of a design philosophy behind this to break a big, complex decision up into many small decisions that can be made using local context information.

Consider the following example

T1 was instructed to be computed

T1 is dispatched to the threadpool

T1 is requested to be released

T1 finishes

The result, i.e. once it finishes could be implemented as

if result == "success": if ts.not_cancelled: put_key_in_memory() else: release_key() else: result == "failed" if ts.not_cancelled: if ts.asked_to_be_fetched_instead: # Whether this is a valid thing for the scheduler to ask is out # of scope for this comment. It happens/happened reschedule_to_fetch_key() else: put_key_in_memory() else: release_key()

With this transition system, it instead becomes

# executing result parser # This only requires local context, the decision should be simple, straight forward if result == "success": recommend_memory() else: assert result == "success" recommend_error() def transition_cancelled_error(...): assert stuff release_task() def transition_cancelled_memory(...): assert stuff put_key_in_memory()

This decision tree is a bit more complex for gathering keys. I'm not 100% convinced anymore if this is the right approach but here we are right now. The recent refactoring will allow us mid-term to move away from this if we choose to do so.

distributed/tests/test_cancelled_state.py

distributed/worker.py

fjetter · 2022-05-05T15:56:06Z

I believe at least some of the test failures relate to #5910

distributed/worker.py

crusaderky · 2022-05-05T22:32:48Z

distributed/worker.py

+        # We'll ignore instructions, i.e. we choose to not submit the failure
+        # message to the scheduler since from the schedulers POV it already
+        # released this task
+        recs, _ = self.transition_executing_error(


All transitions from executing call ensure_computing.
This should deadlock the worker if there are any tasks in ready state.

I added a test for this. This does not deadlock since the transition generates a recommendation. Only after acting on that recommendation we'll get an instruction.

crusaderky

transition_cancelled_error is accidentally dropping Execute instructions

distributed/worker.py

fjetter · 2022-05-05T22:48:07Z

distributed/tests/test_cancelled_state.py

+    # Queue up another task to ensure this is not affected by our error handling
+    fut2 = c.submit(inc, 1)
+    await wait_for_state(fut2.key, "ready", w)


@crusaderky this triggers the condition you are concerned about in the transition function about dropped instructions.

The task is queued up and we'll receive a recommendation. The only instruction at that point is the TaskErred message.

crusaderky · 2022-05-13T11:41:56Z

I think there may be a regression here: #6305 (comment)

fjetter commented Apr 27, 2022

View reviewed changes

distributed/utils_test.py Outdated Show resolved Hide resolved

fjetter force-pushed the in_flight_released branch 2 times, most recently from 768a3a6 to b2fcca3 Compare April 27, 2022 11:39

fjetter requested review from crusaderky and gjoseph92 April 27, 2022 11:51

fjetter mentioned this pull request Apr 27, 2022

Pass on in-flight transfers if they are already released #6199

Closed

fjetter mentioned this pull request Apr 29, 2022

Release 2022.4.2 dask/community#240

Closed

gjoseph92 reviewed Apr 29, 2022

View reviewed changes