Skip to content

Tasks lost by cluster rescale during stealing #3892

@bnaul

Description

@bnaul

Seems similar to to #3256 which was eventually fixed by #3321: we're now seeing the following scheduler logs:

tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealinTraceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 391, in balance
    level, ts, sat, thief, duration, cost_multiplier
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 291, in maybe_move_task
    self.move_task_request(ts, sat, idl)
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 167, in move_task_request
    self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.24.81.35:39991'

Looks possibly related to #3069 based on the lines that were changed there (our tasks do not use resources though). I believe the error occurs when a worker goes down while stealing is underway but it's not easy to reproduce w/o a very large job.

Weirdly I'm actually seeing two different symptoms, which might mean there are actually two bugs here:

  • sometimes the tasks show up in the worker info page as processing, but the worker call stacks are empty and nothing ever happens
  • sometimes the tasks simply show as waiting indefinitely and never reach the processing phase at all

cc @seibert from that PR and also @fjetter from #3619 just in case either of y'all have any theories 🙂

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions