Skip to content

WorkerState unit tests for resumed state#6688

Merged
crusaderky merged 12 commits intodask:mainfrom
crusaderky:resumed_state
Jul 8, 2022
Merged

WorkerState unit tests for resumed state#6688
crusaderky merged 12 commits intodask:mainfrom
crusaderky:resumed_state

Conversation

@crusaderky
Copy link
Copy Markdown
Collaborator

@crusaderky crusaderky commented Jul 7, 2022

Partially closes #6689

@crusaderky crusaderky self-assigned this Jul 7, 2022
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 7, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 26m 2s ⏱️ - 5m 57s
  2 938 tests +  7    2 846 ✔️ +  1       86 💤 +  1  4 +3  2 🔥 +2 
21 774 runs  +63  20 748 ✔️ +29  1 018 💤 +27  6 +5  2 🔥 +2 

For more details on these failures and errors, see this check.

Results for commit 33d846e. ± Comparison against base commit ade4266.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as ready for review July 8, 2022 09:21
@crusaderky
Copy link
Copy Markdown
Collaborator Author

crusaderky commented Jul 8, 2022

This is ready for merge; remediation of the issue will follow in a separate PR.
@hendrikmakait please review.

- executing -> cancelled -> resumed -> executing
- executing -> long-running -> cancelled -> resumed -> long-running
- executing -> cancelled -> executing
- executing -> long-running -> cancelled -> long-running
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we actually send the SecedeEvent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ws_with_running_task

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, the question was directed toward when we "decide" that the second version of the task should also secede(), but I think this makes sense to me after another look at the codebase.

ComputeTaskEvent.dummy("z", who_has={"x": [ws2]}, stimulus_id="s3"),
)
assert instructions == [
GatherDep(worker=ws2, to_gather={"x"}, total_nbytes=1, stimulus_id="s1")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to fire off a second GatherDep for x if the previous one is already ongoing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no second GatherDep in the instructions.
In flight tasks can be in flight from a single worker at a time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked that we're firing all instructions in a single call to ws.handle_stimulus. I thought that this was just the last ComputeTaskEvent creating the instruction.

FreeKeysEvent(keys=["y", "x"], stimulus_id="s2"),
ComputeTaskEvent.dummy("x", stimulus_id="s3"),
# Peer worker does not have the data
GatherDepSuccessEvent(worker=ws2, total_nbytes=1, data={}, stimulus_id="s4"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a GatherDepFailureEvent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A task can get out of flight in one of the following ways:

  • GatherDepSuccessEvent, and the task is in data
  • GatherDepSuccessEvent, but the task is not in data (the remote worker says it doesn't have it)
  • GatherDepNetworkFailureEvent (the remote worker is dead)
  • GatherDepBusyEvent (the remote worker is busy)
  • GatherDepFailureEvent (edge case, generic error e.g. failure to deserialize)

All but the first are a failure to gather the key. That's what I was explaining in the commend on the previous line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docstring gather_dep later terminates with a failure I was under the impression that we were testing an actual GatherDepFailureEvent. Is there a way we can make this clearer? gather_dep terminates successfully but does not have the data, resulting in a failure might be an alternative to stress the point and the rather intricate difference between the successful event and the failing outcome.

Alternatively, I'd prefer just using anything but GatherDepSuccessEvent to avoid further confusion.

crusaderky and others added 5 commits July 8, 2022 13:26
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Copy link
Copy Markdown
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to ignore my comment about a clearer docstring for test_workerstate_flight_failure_to_executing

@crusaderky crusaderky merged commit 887b442 into dask:main Jul 8, 2022
@crusaderky crusaderky deleted the resumed_state branch July 8, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock when a flaky task is stolen

2 participants