Skip to content

AssertionError in decide_worker_non_rootish #8019

@crusaderky

Description

@crusaderky

I got a random CI failure in test_submit_after_failed_worker_async:
https://github.com/dask/distributed/actions/runs/5610919845/jobs/10266547694?pr=8013

I understand this is separate from #6311 as the error is different:

2023-07-20 12:29:03,679 - distributed.scheduler - ERROR - Error transitioning 'sum-1662e8752439e04be62efae3b3703604' from 'waiting' to 'processing'
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 1918, in _transition
    recommendations, client_msgs, worker_msgs = func(
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2292, in transition_waiting_processing
    if not (ws := self.decide_worker_non_rootish(ts)):
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2268, in decide_worker_non_rootish
    assert ws in self.running, (ws, self.running)
AssertionError: (<WorkerState 'tcp://127.0.0.1:34037', status: closing, memory: 4, processing: 0>, {<WorkerState 'tcp://127.0.0.1:45709', name: 1, status: running, memory: 3, processing: 0>, <WorkerState 'tcp://127.0.0.1:34071', name: 0, status: running, memory: 3, processing: 0>})
2023-07-20 12:29:03,681 - distributed.nanny - INFO - Worker closed
2023-07-20 12:29:03,687 - distributed.scheduler - INFO - Remove client Client-026cc191-26f9-11ee-8805-000d3ae2e24a
2023-07-20 12:29:03,688 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34037', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1689856143.6881266')
2023-07-20 12:29:03,689 - distributed.scheduler - INFO - Close client connection: Client-026cc191-26f9-11ee-8805-000d3ae2e24a
2023-07-20 12:29:03,689 - distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 930, in _handle_comm
    result = await result
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5478, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 1014, in handle_stream
    handler(**merge(extra, msg))
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 4488, in update_graph
    self.transitions(recommendations, stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 7566, in transitions
    self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2024, in _transitions
    new_recs, new_cmsgs, new_wmsgs = self._transition(key, finish, stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 1918, in _transition
    recommendations, client_msgs, worker_msgs = func(
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2292, in transition_waiting_processing
    if not (ws := self.decide_worker_non_rootish(ts)):
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 2268, in decide_worker_non_rootish
    assert ws in self.running, (ws, self.running)
AssertionError: (<WorkerState 'tcp://127.0.0.1:34037', status: closed, memory: 0, processing: 0>, {<WorkerState 'tcp://127.0.0.1:45709', name: 1, status: running, memory: 0, processing: 0>, <WorkerState 'tcp://127.0.0.1:34071', name: 0, status: running, memory: 0, processing: 0>})

Metadata

Metadata

Assignees

Labels

bugSomething is brokendeadlockThe cluster appears to not make any progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions