-
-
Notifications
You must be signed in to change notification settings - Fork 756
Description
With #6361, any temporary network disconnect will shut down the worker.
Currently, that won't be considered a safe closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.
Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.
But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.