Alternative handling of killed/unexpected worker in transition_processing_erred#6939
Alternative handling of killed/unexpected worker in transition_processing_erred#6939
transition_processing_erred#6939Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 34m 21s ⏱️ + 7m 20s For more details on these failures, see this check. Results for commit 3ca7e35. ± Comparison against base commit c15a10e. |
|
|
||
|
|
||
| def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str: | ||
| def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str | None: |
There was a problem hiding this comment.
After putting some more thought into it, I think that _remove_from_processing should always return None. Its job is to
Remove ts from the set of processing tasks.
At either return statement, it has successfully completed its task.
In your PR, transition_processing_released is the only place where we care about the undocumented feature that it makes a distinction for us between returning None if it encounters a stale worker state and ws.address if it encounters a current worker state. IMO, this muddles up mutation and querying logic and introduces additional complexity that would at least need to be documented. Why not let transition_processing_released worry about it?
From #6392 we also know that returning an address is most likely not a good idea since it's not a unique identifier.
There was a problem hiding this comment.
I'd be fine with that. The one argument for it not always returning None is that it's kinda like a pop—it does ts.processing_on = None, so after it's run, callers can't find out what worker the task was processing on.
It could be reasonable to say that callers need to pull out ts.processing_on first, but then they also need to check if that WorkerState is stale, which is more places to get things wrong. Since _remove_from_processing has to do the stale check anyway, I kinda like having it return the result of that check (in that it returns None if the worker is stale). Returning a WorkerState instead of an address in the non-stale case might be more practical though.
|
|
||
| ts.erred_on.add(w) | ||
| # NOTE: `worker` is None if it died, i.e. `KilledWorker` | ||
| ts.erred_on.add(worker or maybe_stale_ws.address) |
There was a problem hiding this comment.
Having two possible sources for the worker address we use here still feels like a consistency issue. Why not just use maybe_stale_ws.address?
There was a problem hiding this comment.
Because at the moment, it's possible that the task ran (and failed) on a worker we didn't expect it to. Since this PR was making the erred case consistent with memory, it's possible worker != maybe_stale_ws.address, in which case worker is the correct address since that's where the task actually ran and failed.
|
Closed by #6946 |
Alternative approach to #6884, just for discussion—easier than trying to write out what I was thinking.
_remove_from_processingreturns None again if the WorkerState is stale Always returnws.addressfrom_remove_from_processing#6884 (comment)ws.addressfrom_remove_from_processing#6884 (comment)I feel like we also need need a test triggering whatever caused #6874 to happen?
cc @hendrikmakait @fjetter
Closes #6874
pre-commit run --all-files