Conversation
There was a problem hiding this comment.
This works, but the timings can be very slow on a busy cluster:
- a task which loses its last worker from who_has will not be included in the find_missing query until it reaches the very top of data_needed. On a busy node, this can take tens of seconds.
- a task which gains a worker in who_has while all other workers are in flight or busy will not transition from fetch to flight until something else kicks off ensure_communicating. The realistic worst case is that nothing will until a worker goes out of flight, which may take tens of seconds.
| if not ts.who_has: | ||
| recommendations[ts] = "missing" | ||
| continue | ||
|
|
There was a problem hiding this comment.
Must remove assertion (and add a comment why the assertion isn't there) in validate_task_fetch
| "which is not true.", | ||
| self.address, | ||
| ts, | ||
| ) |
There was a problem hiding this comment.
Must indent one more level
| for worker in del_workers: | ||
| self.has_what[worker].discard(key) | ||
| # Can't remove from self.data_needed_per_worker; there is logic | ||
| # in _select_keys_for_gather to deal with this |
There was a problem hiding this comment.
missing the mentioned logic from my PR
diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
assert not args
finish, *args = finish # type: ignore
- if ts is None or ts.state == finish:
+ if ts is None:
return {}, []makes I don't know, yet, if this has any unforeseen side effects, though |
It doesn't work properly - the task is transitioned to missing and is "rescued" 1 second later by find_missing. |
Alternative to #6342
This includes all the logic of #6342 in
update_who_hasthat reduces the information in has_whas/who_has, i.e. ensure that the state aligns with the most recent information the scheduler provided but it does not include the transitions as part of update_who_has. This allows for an overall much less invasive change.It also involved the tests proposed over there. The only modifications are that I removed the update-who-has log and reduced the story events in the log a bit.
@crusaderky You still have a few other cleanup things in your PR. I was merely curious to get your thoughts on this because I would prefer having a smaller footprint of this change and it seems to be doable