-
-
Notifications
You must be signed in to change notification settings - Fork 757
Open
Description
Seems similar to to #3256 which was eventually fixed by #3321: we're now seeing the following scheduler logs:
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealinTraceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 391, in balance
level, ts, sat, thief, duration, cost_multiplier
File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 291, in maybe_move_task
self.move_task_request(ts, sat, idl)
File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 167, in move_task_request
self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.24.81.35:39991'
Looks possibly related to #3069 based on the lines that were changed there (our tasks do not use resources though). I believe the error occurs when a worker goes down while stealing is underway but it's not easy to reproduce w/o a very large job.
Weirdly I'm actually seeing two different symptoms, which might mean there are actually two bugs here:
- sometimes the tasks show up in the worker info page as
processing, but the worker call stacks are empty and nothing ever happens - sometimes the tasks simply show as
waitingindefinitely and never reach theprocessingphase at all
cc @seibert from that PR and also @fjetter from #3619 just in case either of y'all have any theories 🙂
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels