Skip to content

Consider connection-failure worker closures as safe? #6386

@gjoseph92

Description

@gjoseph92

With #6361, any temporary network disconnect will shut down the worker.

Currently, that won't be considered a safe closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.

Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.

But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussing a topic with no specific actions yetnetworking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions