WorkerState unit tests for resumed state#6688
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 26m 2s ⏱️ - 5m 57s For more details on these failures and errors, see this check. Results for commit 33d846e. ± Comparison against base commit ade4266. ♻️ This comment has been updated with latest results. |
|
This is ready for merge; remediation of the issue will follow in a separate PR. |
| - executing -> cancelled -> resumed -> executing | ||
| - executing -> long-running -> cancelled -> resumed -> long-running | ||
| - executing -> cancelled -> executing | ||
| - executing -> long-running -> cancelled -> long-running |
There was a problem hiding this comment.
When do we actually send the SecedeEvent?
There was a problem hiding this comment.
in ws_with_running_task
There was a problem hiding this comment.
Never mind, the question was directed toward when we "decide" that the second version of the task should also secede(), but I think this makes sense to me after another look at the codebase.
| ComputeTaskEvent.dummy("z", who_has={"x": [ws2]}, stimulus_id="s3"), | ||
| ) | ||
| assert instructions == [ | ||
| GatherDep(worker=ws2, to_gather={"x"}, total_nbytes=1, stimulus_id="s1") |
There was a problem hiding this comment.
Does it make sense to fire off a second GatherDep for x if the previous one is already ongoing?
There was a problem hiding this comment.
There is no second GatherDep in the instructions.
In flight tasks can be in flight from a single worker at a time.
There was a problem hiding this comment.
I overlooked that we're firing all instructions in a single call to ws.handle_stimulus. I thought that this was just the last ComputeTaskEvent creating the instruction.
| FreeKeysEvent(keys=["y", "x"], stimulus_id="s2"), | ||
| ComputeTaskEvent.dummy("x", stimulus_id="s3"), | ||
| # Peer worker does not have the data | ||
| GatherDepSuccessEvent(worker=ws2, total_nbytes=1, data={}, stimulus_id="s4"), |
There was a problem hiding this comment.
Shouldn't this be a GatherDepFailureEvent?
There was a problem hiding this comment.
A task can get out of flight in one of the following ways:
- GatherDepSuccessEvent, and the task is in data
- GatherDepSuccessEvent, but the task is not in data (the remote worker says it doesn't have it)
- GatherDepNetworkFailureEvent (the remote worker is dead)
- GatherDepBusyEvent (the remote worker is busy)
- GatherDepFailureEvent (edge case, generic error e.g. failure to deserialize)
All but the first are a failure to gather the key. That's what I was explaining in the commend on the previous line.
There was a problem hiding this comment.
From the docstring gather_dep later terminates with a failure I was under the impression that we were testing an actual GatherDepFailureEvent. Is there a way we can make this clearer? gather_dep terminates successfully but does not have the data, resulting in a failure might be an alternative to stress the point and the rather intricate difference between the successful event and the failing outcome.
Alternatively, I'd prefer just using anything but GatherDepSuccessEvent to avoid further confusion.
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
hendrikmakait
left a comment
There was a problem hiding this comment.
LGTM, feel free to ignore my comment about a clearer docstring for test_workerstate_flight_failure_to_executing
Partially closes #6689