-
-
Notifications
You must be signed in to change notification settings - Fork 758
AssertionError in decide_worker_non_rootish #8019
Copy link
Copy link
Closed
Labels
bugSomething is brokenSomething is brokendeadlockThe cluster appears to not make any progressThe cluster appears to not make any progress
Description
I got a random CI failure in test_submit_after_failed_worker_async:
https://github.com/dask/distributed/actions/runs/5610919845/jobs/10266547694?pr=8013
I understand this is separate from #6311 as the error is different:
2023-07-20 12:29:03,679 - distributed.scheduler - ERROR - Error transitioning 'sum-1662e8752439e04be62efae3b3703604' from 'waiting' to 'processing'
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 1918, in _transition
recommendations, client_msgs, worker_msgs = func(
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2292, in transition_waiting_processing
if not (ws := self.decide_worker_non_rootish(ts)):
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2268, in decide_worker_non_rootish
assert ws in self.running, (ws, self.running)
AssertionError: (<WorkerState 'tcp://127.0.0.1:34037', status: closing, memory: 4, processing: 0>, {<WorkerState 'tcp://127.0.0.1:45709', name: 1, status: running, memory: 3, processing: 0>, <WorkerState 'tcp://127.0.0.1:34071', name: 0, status: running, memory: 3, processing: 0>})
2023-07-20 12:29:03,681 - distributed.nanny - INFO - Worker closed
2023-07-20 12:29:03,687 - distributed.scheduler - INFO - Remove client Client-026cc191-26f9-11ee-8805-000d3ae2e24a
2023-07-20 12:29:03,688 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34037', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1689856143.6881266')
2023-07-20 12:29:03,689 - distributed.scheduler - INFO - Close client connection: Client-026cc191-26f9-11ee-8805-000d3ae2e24a
2023-07-20 12:29:03,689 - distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 930, in _handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5478, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/home/runner/work/distributed/distributed/distributed/core.py", line 1014, in handle_stream
handler(**merge(extra, msg))
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 4488, in update_graph
self.transitions(recommendations, stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 7566, in transitions
self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2024, in _transitions
new_recs, new_cmsgs, new_wmsgs = self._transition(key, finish, stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 1918, in _transition
recommendations, client_msgs, worker_msgs = func(
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2292, in transition_waiting_processing
if not (ws := self.decide_worker_non_rootish(ts)):
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2268, in decide_worker_non_rootish
assert ws in self.running, (ws, self.running)
AssertionError: (<WorkerState 'tcp://127.0.0.1:34037', status: closed, memory: 0, processing: 0>, {<WorkerState 'tcp://127.0.0.1:45709', name: 1, status: running, memory: 0, processing: 0>, <WorkerState 'tcp://127.0.0.1:34071', name: 0, status: running, memory: 0, processing: 0>})
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething is brokenSomething is brokendeadlockThe cluster appears to not make any progressThe cluster appears to not make any progress